schmeing / ReSeq

More realistic simulator for genomic DNA sequences from Illumina machines that achieves a similar k-mer spectrum as the original
MIT License
52 stars 3 forks source link

Error: The ordering of the reference sequences in the bam file is not identical to the reference file #21

Open yulijia opened 11 months ago

yulijia commented 11 months ago

Hi, I am using BWA to map my FASTQ file to the hg38 genome. However, when I run the 'stats' command, it consistently stops at the beginning and throws the error message mentioned in the issue title.

Here is the command I'm using:

$BWA mem -M -t $THREADS -R "@RG\tID:$RGID\tLB:$RGLB\tPL:$RGPL\tPU:$RGPU\tSM:$PREFIX" $GENOME_REFERENCE $OUT_DIR/fastp/$FQ1.trimed.fq.gz  $OUT_DIR/fastp/$FQ2.trimed.fq.gz |\
 $SAMTOOLS sort -@ $THREADS -m $MAX_MEM -O BAM -T $PREFIX -o $OUT_DIR/$PREFIX.sorted.bam -
reseq illuminaPE -j 2 -r ../../../reference/Homo_sapiens_assembly38.fasta -b normal.recal.bam --statsOnly

>>> 16-11-23 22:50:50: info:  Running ReSeq version 1.1 in illuminaPE mode
>>> 16-11-23 22:50:50: info:  Reading reference from ../../../reference/Homo_sapiens_assembly38.fasta
>>> 16-11-23 22:51:30: info:  Read in 3366 reference sequences.
>>> 16-11-23 22:51:33: info:  Reading mapping from normal.recal.bam
>>> 16-11-23 22:51:33: info:  Storing real data statistics in normal.recal.bam.reseq
>>> 16-11-23 22:51:33: info:  Tiles will be ignored and all statistics will be generated like all reads are from the same tile.
!!! bool reseq::DataStats::ReadBam(const char*, const char*, const char*, const std::string&, reseq::uintSeqLen, reseq::uintNumThreads, bool) [/opt/conda/conda-bld/reseq_1684135748409/work/reseq/DataStats.cpp:1166]: error: The ordering of the reference sequences in the bam file is not identical to the reference file
HLA-DRB1*16:02:01
HLA-DRB1*16:02:01   HLA00878 11005 bp

The error seems to be originating from the following part of the code:

https://github.com/schmeing/ReSeq/blob/053b8d1af635bf0019dc818b3eb0825023abd037/reseq/DataStats.cpp#L1165C1-L1166C1

Could you please help me understand how to resolve this issue?

Thanks in advance, Lijia

jo-mc commented 11 months ago

is $GENOME_REFERENCE the same as Homo_sapiens_assembly38.fasta ? i guess so,

then in reseq you are using normal.recal.bam (not sorted?)

is this the sorted bam output ? $PREFIX.sorted.bam .......you should be using this in reseq ????.sorted.bam. ??? = PREFIX

maybe something like: reseq illuminaPE -j 2 -r ../../../reference/Homo_sapiens_assembly38.fasta -b normal.recal.sorted.bam --statsOnly

(Note: I have not used reseq in a while)

yulijia commented 11 months ago

Hi @jo-mc ,

Thank you for your quick response. it is sorted bam file.

Is it possible that the string 'HLA-DRB1*16:02:01' including the asterisk (*), could cause an error when being compared?

jo-mc commented 11 months ago

No that should be ok, Check the RNAME order in the bam header against the reference contig names.

samtools view -H normal.recal.bam | less

HLA-DRB1*16:02:01 is the last conitg in Homo_sapiens_assembly38.fasta, it should be the last in the bam header. The 3366th entry (SN: in bam header). (list contig in reference with : " awk '{ if ($0 ~/^>/) { print $1 } }' Homo_sapiens_assembly38.fasta | less " )

yulijia commented 11 months ago

Hi @jo-mc ,

Thanks for your suggestion.

My bam file header includes all chromosome name in the fasta file, one HD line and two PG lines.

I am not sure if it is because the HD or PG header line. I don't know how to remove all PG lines, I have tried the command line, but i still have two PG lines in header.

/usr/bin/samtools view -h normal_fixed.bam | grep -v "^@PG" | sed "s/\tPG:Z:[^\t]*//" | samtools view -bo your_fixed.bam -

I have upload the bam header and the fasta chromosome names. bam_header.txt fasta_chrname.txt

The difference lines between the two files are (I only compared the difference between the second column in BAM header line):

0a1 VN:1.5 3366a3368,3369 ID:samtools ID:samtools.1

Please let me know if you have any further suggestions.

Thank you!

jo-mc commented 11 months ago

Hi, I ran a test with illumina reads aligned to Homo_sapiens_assembly38.fasta, generated a bam and ran reseq and I get the same error? Not sure why the code looks ok? and the order and names look right. Maybe need @schmeing to have a look?

jo-mc commented 11 months ago

If you use the "ucsc no alt grc38 reference" it looks like it is working, but I have not done a full test,

[GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz] (https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz)

Location: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/

The "analysis set" is a version of the genome prepared for next-gen sequencing read alignment. It contains no alternate sequences, no patches, fixes or haplotypes, only the main chromosomes.

yulijia commented 11 months ago

Hi @jo-mc ,

Thank you for doing a lot of tests! Based on your testing, it seems that the sequence of ALT, decoy and HLA genes makes it challenging for the software to simulate data. I'm not sure if other simulators also encounter the same issue. I have switched to using ART, and if there are any updates, I will post about this issue.

yulijia commented 11 months ago

Hi @jo-mc ,

I remapped my data with GCA_000001405.15_GRCh38_no_alt_analysis_set.fna. However, in the first step, I encounter the Segmentation fault error every time. I use the adapter tag to prevent the software from estimating the adapter sequence by itself. I have enough memory (>512GB) on my server.

Can you guide me on how to troubleshoot this issue?

Thanks.

(py3) [ylj@localhost]$ reseq illuminaPE -j 8 -r ../../../reference/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna -b normal.recal.bam --statsOnly --adapterFile TruSeq_single
>>> 28-11-23 10:43:01: info:  Running ReSeq version 1.1 in illuminaPE mode
>>> 28-11-23 10:43:01: info:  Reading reference from ../../../reference/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna
>>> 28-11-23 10:43:42: info:  Read in 195 reference sequences.
>>> 28-11-23 10:43:48: info:  Reading mapping from normal.recal.bam
>>> 28-11-23 10:43:48: info:  Storing real data statistics in normal.recal.bam.reseq
>>> 28-11-23 10:43:48: info:  Tiles will be ignored and all statistics will be generated like all reads are from the same tile.
>>> 28-11-23 10:44:32: info:  Loading adapters from: /home/ylj/miniconda3/envs/py3/etc/reseq/adapters/TruSeq_single.fa
>>> 28-11-23 10:44:32: info:  Loading adapter combination matrix from: /home/ylj/miniconda3/envs/py3/etc/reseq/adapters/TruSeq_single.mat
>>> 28-11-23 10:44:32: info:  Starting PreRun
>>> 28-11-23 10:46:12: info:  Finished PreRun
Total number of reads: 77612531
Read length: 31 - 90
Read length on reference: 31 - 130
Quality: 1 - 30 ( 33-based encoding )
Mapping Quality: 0 - 60
Maximum InDel Length: 40
>>> 28-11-23 10:46:12: info:  Starting main read-in
>>> 28-11-23 10:46:46: info:  Read 5% of the reads.
Segmentation fault
jo-mc commented 11 months ago

I tried a few variations and still get errors. Maybe move to something else?

yulijia commented 11 months ago

Thanks! The BAM file I used above has undergone read recalibration with GATK. However, ReSeq cannot process data with recalibrated reads. Fortunately, I can run ReSeq successfully using directly mapped reads.