sunnyisgalaxy / moabs

A comprehensive, accurate and efficient solution for analysis of large scale base-resolution DNA methylation data
14 stars 10 forks source link

Segmentation fault (core dumped) #22

Open demis001 opened 4 years ago

demis001 commented 4 years ago

I have 256G RAM on the system. I don't know why it is giving the Segmentation fault (core dumped) error. The alignment is from bismark and properly sorted.

mcall -m CDAA158_merged_final_sorted_dedup.deduplicated.bam -m CDAA328_merged_final_sorted_dedup.deduplicated.bam -m CDAA591_merged_final_sorted_dedup.deduplicated.bam -m CDAA94_merged_final_sorted_dedup.deduplicated.bam -p 2 --sampleName CDAA -r ${GENODIR}/hg38_r87.fa --skipRandomChrom 1

lijinbio commented 4 years ago

Bismark may have been updated, and its output fields are not suitable for the latest MCALL in our MOABS. Please try to generate a BAM using BSMAP, which is also a component in MOABS. Please let us know if the error persists with a BSMAP BAM file.

demis001 commented 4 years ago

Got this one for BSMAP bam files

mcall -m ADAA760_sorted_dedup.bam 
Options are saved in file run.config and printed here:
cytosineMinScore=20
excludedFlag=0
fullMode=0
keepTemp=0
mappedFiles=ADAA760_sorted_dedup.bam 
minFragSize=0
minMMFragSize=0
nextBaseMinScore=3
processPEOverlapSeq=1
qualityScoreBase=0
reportCHX=X
reportCpX=G
requiredFlag=0
skipRandomChrom=1
statsOnly=0
threads=1
trimRRBSEndRepairSeq=2
trimWGBSEndRepairPE1Seq=3
trimWGBSEndRepairPE2Seq=3
Program started
From the extension of file ADAA760_sorted_dedup.bam, program is parsing file according to BAM foramt
XR and ZS fields are found in first 5 mapped alignment
For file ADAA760_sorted_dedup.bam, the quality score format is Sanger format based at 33!
Protocol and read length are detected as WGBS and 135 bases for file ADAA760_sorted_dedup.bam
For file ADAA760_sorted_dedup.bam the number of all reads is 97896312 and the number of mapped reads is 97896312
Start processing file ADAA760_sorted_dedup.bam on chrom 1
terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::substr
Aborted (core dumped)
 mcall -m ADCA1594_sorted_dedup.bam -m ADCA1665_sorted_dedup.bam -m ADCA1696_sorted_dedup.bam -m ADCA1705_sorted_dedup.bam -p 10 --sampleName ADCA -r /data1/genome/hg38_r87/hg38_r87.fa --skipRandomChrom 1 --fullMode 0
Options are saved in file run.config and printed here:
cytosineMinScore=20
excludedFlag=0
fullMode=0
keepTemp=0
mappedFiles=ADCA1594_sorted_dedup.bam ADCA1665_sorted_dedup.bam ADCA1696_sorted_dedup.bam ADCA1705_sorted_dedup.bam 
minFragSize=0
minMMFragSize=0
nextBaseMinScore=3
processPEOverlapSeq=1
qualityScoreBase=0
reference=/data1/genome/hg38_r87/hg38_r87.fa
reportCHX=X
reportCpX=G
requiredFlag=0
sampleName=ADCA
skipRandomChrom=1
statsOnly=0
threads=10
trimRRBSEndRepairSeq=2
trimWGBSEndRepairPE1Seq=3
trimWGBSEndRepairPE2Seq=3
Program started
From the extension of file ADCA1594_sorted_dedup.bam, program is parsing file according to BAM foramt
From the extension of file ADCA1665_sorted_dedup.bam, program is parsing file according to BAM foramt
From the extension of file ADCA1696_sorted_dedup.bam, program is parsing file according to BAM foramt
From the extension of file ADCA1705_sorted_dedup.bam, program is parsing file according to BAM foramt
XR and ZS fields are found in first 5 mapped alignment
XR and ZS fields are found in first 5 mapped alignment
XR and ZS fields are found in first 5 mapped alignment
XR and ZS fields are found in first 5 mapped alignment
For file ADCA1665_sorted_dedup.bam, the quality score format is Sanger format based at 33!
For file ADCA1696_sorted_dedup.bam, the quality score format is Sanger format based at 33!
For file ADCA1705_sorted_dedup.bam, the quality score format is Sanger format based at 33!
For file ADCA1594_sorted_dedup.bam, the quality score format is Sanger format based at 33!
Protocol and read length are detected as WGBS and 135 bases for file ADCA1705_sorted_dedup.bam
Protocol and read length are detected as WGBS and 135 bases for file ADCA1696_sorted_dedup.bam
Protocol and read length are detected as WGBS and 135 bases for file ADCA1665_sorted_dedup.bam
Protocol and read length are detected as WGBS and 135 bases for file ADCA1594_sorted_dedup.bam
For file ADCA1665_sorted_dedup.bam the number of all reads is 112369553 and the number of mapped reads is 112369553
For file ADCA1705_sorted_dedup.bam the number of all reads is 135269276 and the number of mapped reads is 135269276
For file ADCA1696_sorted_dedup.bam the number of all reads is 120941544 and the number of mapped reads is 120941544
For file ADCA1594_sorted_dedup.bam the number of all reads is 119670725 and the number of mapped reads is 119670725
Start processing file ADCA1594_sorted_dedup.bam on chrom 13
Start processing file ADCA1594_sorted_dedup.bam on chrom 17
Start processing file ADCA1594_sorted_dedup.bam on chrom 15
Start processing file ADCA1594_sorted_dedup.bam on chrom 12
Start processing file ADCA1594_sorted_dedup.bam on chrom 14
Start processing file ADCA1665_sorted_dedup.bam on chrom 17
Start processing file ADCA1594_sorted_dedup.bam on chrom 10
Start processing file ADCA1594_sorted_dedup.bam on chrom 16
Start processing file ADCA1665_sorted_dedup.bam on chrom 13
Start processing file ADCA1665_sorted_dedup.bam on chrom 15
Start processing file ADCA1665_sorted_dedup.bamStart processing file  on chrom ADCA1665_sorted_dedup.bam on chrom 10
Start processing file ADCA1594_sorted_dedup.bam on chrom 18
Start processing file ADCA1665_sorted_dedup.bam on chrom 12
18
Start processing file Start processing file ADCA1665_sorted_dedup.bam on chrom 11
Start processing file ADCA1696_sorted_dedup.bam on chrom 10
Start processing file ADCA1665_sorted_dedup.bam on chrom 14
ADCA1665_sorted_dedup.bam on chrom 16
Start processing file ADCA1594_sorted_dedup.bam on chrom 11
Start processing file ADCA1696_sorted_dedup.bam on chrom 17
Start processing file ADCA1696_sorted_dedup.bam on chrom 15
Start processing file ADCA1696_sorted_dedup.bam on chrom 11
Start processing file ADCA1696_sorted_dedup.bam on chrom 12
Start processing file ADCA1696_sorted_dedup.bam on chrom 16
Start processing file ADCA1696_sorted_dedup.bam on chrom 14
Start processing file ADCA1705_sorted_dedup.bam on chrom 10
Start processing file ADCA1696_sorted_dedup.bam on chrom 13
Start processing file ADCA1696_sorted_dedup.bam on chrom 18
Start processing file ADCA1705_sorted_dedup.bam on chrom 11
Start processing file ADCA1705_sorted_dedup.bam on chrom 15
Start processing file ADCA1705_sorted_dedup.bam on chrom 12
Start processing file ADCA1705_sorted_dedup.bam on chrom 17
Start processing file ADCA1705_sorted_dedup.bam on chrom 13
Start processing file ADCA1705_sorted_dedup.bam on chrom 16
Start processing file ADCA1705_sorted_dedup.bam on chrom 18
Start processing file ADCA1705_sorted_dedup.bam on chrom 14
Start processing file Start processing file Start processing file ADCA1696_sorted_dedup.bam on chrom ADCA1594_sorted_dedup.bam1 on chrom 
1
ADCA1705_sorted_dedup.bam on chrom 1
Start processing file ADCA1665_sorted_dedup.bam on chrom 1
terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::substr
Aborted (core dumped)
demis001 commented 4 years ago

Tried both conda version and Linux pre-compiled. Both gave the same error. An error shows as if something wrong with the for loop.

demis001 commented 4 years ago

conda version shows this at the end:

terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::substr: __pos (which is 9994) > this->size() (which is 0)
Aborted (core dumped)
lijinbio commented 4 years ago

Yes, the second calling with the reference FASTA is supposed to be correct. This error is likely related to the overflow of the reference genome. Can you ensure the genome sequence is the same as the one used in the BSMAP mapping?

demis001 commented 4 years ago

Yes it is. I used the same fasta file

On Tue, Feb 11, 2020 at 10:22 AM Jin Li notifications@github.com wrote:

Yes, the second calling with the reference FASTA is supposed to be correct. This error is likely related to the overflow of the reference genome. Can you ensure the genome sequence is the same as the one used in the BSMAP mapping?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/sunnyisgalaxy/moabs/issues/22?email_source=notifications&email_token=ACCPKKUAFGUI4WAAUTF6VGLRCK7EJA5CNFSM4KR4SO72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELM2GVY#issuecomment-584688471, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACCPKKUDF4KRQSDBMEFRQCDRCK7EJANCNFSM4KR4SO7Q .

demis001 commented 4 years ago
 GENODIR=/data1/genome/hg38_r87/hg38_r87.fa
 bsmap -a ${SAMPLE}_L001_R1_val_1.fq.gz -b ${SAMPLE}_L001_R2_val_2.fq.gz -d ${GENODIR} -o ${SAMPLE}.bam  -p 8 -L 135  -w 100  -v 10 -q 10 -R -V 1
lijinbio commented 4 years ago

conda version shows this at the end:

terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::substr: __pos (which is 9994) > this->size() (which is 0)
Aborted (core dumped)

This error is tried to extract the reference sequence of the mapped read. The chromosome position starts from 9994, but the length of its DNA sequence is 0, so the overflow occurs.

lijinbio commented 4 years ago
 GENODIR=/data1/genome/hg38_r87/hg38_r87.fa
 bsmap -a ${SAMPLE}_L001_R1_val_1.fq.gz -b ${SAMPLE}_L001_R2_val_2.fq.gz -d ${GENODIR} -o ${SAMPLE}.bam  -p 8 -L 135  -w 100  -v 10 -q 10 -R -V 1

Can you help also paste the log info of the BSMAP output?

demis001 commented 4 years ago
[bsmap] @Mon Feb 10 13:29:54 2020   loading reference file: /data1/genome/hg38_r87/hg38_r87.fa  (format: FASTA)
 [bsmap] @Mon Feb 10 13:30:27 2020   25 reference seqs loaded, total size 3088286401 bp. 33 secs passed
 [bsmap] @Mon Feb 10 13:32:26 2020   create seed table. 152 secs passed
 [bsmap] @Mon Feb 10 13:32:26 2020   Pair-end alignment(8 threads),
     Input read file #1: CDAA328_L001_R1_val_1.fq.gz     (format: gzipped FASTQ)
     Input read file #2: CDAA328_L001_R2_val_2.fq.gz     (format: gzipped FASTQ)
     Output file: CDAA328.bam     (format: SAM, automatically convert to BAM)
 [bsmap] @Mon Feb 10 15:12:24 2020   total read pairs: 53540205  total time consumed:  13369 secs
     aligned pairs: 45576561 (85.1%), unique pairs: 43311018 (80.9%), non-unique pairs: 2265543 (4.2%)
     unpaired read #1: 3887678 (7.3%), unique reads: 3438870 (6.4%), non-unique reads: 448808 (0.8%)
     unpaired read #2: 2637656 (4.9%), unique reads: 2153578 (4.0%), non-unique reads: 484078 (0.9%)
demis001 commented 4 years ago

Let me see if the sort failed

demis001 commented 4 years ago

Everything seems ok! to me, I don't know why it failed. I have used it without any problem three years ago without any problem. This isn't my first time. We published using it in 2015

demis001 commented 4 years ago

I run 4 task at the same time, did that create a problem? I split the number of samples in four files and loaded in bash for loop: As far I can see the alignment seems ok.

bsmap] @Sun Feb  9 19:32:25 2020    loading reference file: /data1/genome/hg38_r87/hg38_r87.fa  (format: FASTA)
[bsmap] @Sun Feb  9 19:32:54 2020   25 reference seqs loaded, total size 3088286401 bp. 29 secs passed
[bsmap] @Sun Feb  9 19:33:13 2020   loading reference file: /data1/genome/hg38_r87/hg38_r87.fa  (format: FASTA)
[bsmap] @Sun Feb  9 19:33:42 2020   25 reference seqs loaded, total size 3088286401 bp. 29 secs passed
[bsmap] @Sun Feb  9 19:33:50 2020   loading reference file: /data1/genome/hg38_r87/hg38_r87.fa  (format: FASTA)
[bsmap] @Sun Feb  9 19:34:20 2020   25 reference seqs loaded, total size 3088286401 bp. 30 secs passed
[bsmap] @Sun Feb  9 19:34:22 2020   loading reference file: /data1/genome/hg38_r87/hg38_r87.fa  (format: FASTA)
[bsmap] @Sun Feb  9 19:34:52 2020   25 reference seqs loaded, total size 3088286401 bp. 30 secs passed
[bsmap] @Sun Feb  9 19:34:57 2020   create seed table. 152 secs passed
[bsmap] @Sun Feb  9 19:34:57 2020   Pair-end alignment(8 threads),
    Input read file #1: ADAA1381_L001_R1_val_1.fq.gz    (format: gzipped FASTQ)
    Input read file #2: ADAA1381_L001_R2_val_2.fq.gz    (format: gzipped FASTQ)
    Output file: ADAA1381.bam    (format: SAM, automatically convert to BAM)
[bsmap] @Sun Feb  9 19:35:40 2020   create seed table. 147 secs passed
[bsmap] @Sun Feb  9 19:35:40 2020   Pair-end alignment(8 threads),
    Input read file #1: ADAA760_L001_R1_val_1.fq.gz     (format: gzipped FASTQ)
    Input read file #2: ADAA760_L001_R2_val_2.fq.gz     (format: gzipped FASTQ)
    Output file: ADAA760.bam     (format: SAM, automatically convert to BAM)
[bsmap] @Sun Feb  9 19:36:29 2020   create seed table. 159 secs passed
[bsmap] @Sun Feb  9 19:36:29 2020   Pair-end alignment(8 threads),
    Input read file #1: ADCA1705_L001_R1_val_1.fq.gz    (format: gzipped FASTQ)
    Input read file #2: ADCA1705_L001_R2_val_2.fq.gz    (format: gzipped FASTQ)
    Output file: ADCA1705.bam    (format: SAM, automatically convert to BAM)
[bsmap] @Sun Feb  9 19:37:11 2020   create seed table. 169 secs passed
[bsmap] @Sun Feb  9 19:37:11 2020   Pair-end alignment(8 threads),
    Input read file #1: CDAA94_L001_R1_val_1.fq.gz  (format: gzipped FASTQ)
    Input read file #2: CDAA94_L001_R2_val_2.fq.gz  (format: gzipped FASTQ)
    Output file: CDAA94.bam  (format: SAM, automatically convert to BAM)
[bsmap] @Sun Feb  9 23:59:47 2020   total read pairs: 55375703  total time consumed:  15994 secs
    aligned pairs: 47692697 (86.1%), unique pairs: 45199732 (81.6%), non-unique pairs: 2492965 (4.5%)
    unpaired read #1: 3966235 (7.2%), unique reads: 3486447 (6.3%), non-unique reads: 479788 (0.9%)
    unpaired read #2: 2743656 (5.0%), unique reads: 2149537 (3.9%), non-unique reads: 594119 (1.1%)
lijinbio commented 4 years ago

Let me see if the sort failed

So the BAM is well-sorted? The BSMAP output will include the sorting process info.

lijinbio commented 4 years ago

I run 4 task at the same time, did that create a problem? I split the number of samples in four files and loaded in bash for loop: As far I can see the alignment seems ok.

bsmap] @Sun Feb  9 19:32:25 2020  loading reference file: /data1/genome/hg38_r87/hg38_r87.fa  (format: FASTA)
[bsmap] @Sun Feb  9 19:32:54 2020     25 reference seqs loaded, total size 3088286401 bp. 29 secs passed
[bsmap] @Sun Feb  9 19:33:13 2020     loading reference file: /data1/genome/hg38_r87/hg38_r87.fa  (format: FASTA)
[bsmap] @Sun Feb  9 19:33:42 2020     25 reference seqs loaded, total size 3088286401 bp. 29 secs passed
[bsmap] @Sun Feb  9 19:33:50 2020     loading reference file: /data1/genome/hg38_r87/hg38_r87.fa  (format: FASTA)
[bsmap] @Sun Feb  9 19:34:20 2020     25 reference seqs loaded, total size 3088286401 bp. 30 secs passed
[bsmap] @Sun Feb  9 19:34:22 2020     loading reference file: /data1/genome/hg38_r87/hg38_r87.fa  (format: FASTA)
[bsmap] @Sun Feb  9 19:34:52 2020     25 reference seqs loaded, total size 3088286401 bp. 30 secs passed
[bsmap] @Sun Feb  9 19:34:57 2020     create seed table. 152 secs passed
[bsmap] @Sun Feb  9 19:34:57 2020     Pair-end alignment(8 threads),
  Input read file #1: ADAA1381_L001_R1_val_1.fq.gz    (format: gzipped FASTQ)
  Input read file #2: ADAA1381_L001_R2_val_2.fq.gz    (format: gzipped FASTQ)
  Output file: ADAA1381.bam    (format: SAM, automatically convert to BAM)
[bsmap] @Sun Feb  9 19:35:40 2020     create seed table. 147 secs passed
[bsmap] @Sun Feb  9 19:35:40 2020     Pair-end alignment(8 threads),
  Input read file #1: ADAA760_L001_R1_val_1.fq.gz     (format: gzipped FASTQ)
  Input read file #2: ADAA760_L001_R2_val_2.fq.gz     (format: gzipped FASTQ)
  Output file: ADAA760.bam     (format: SAM, automatically convert to BAM)
[bsmap] @Sun Feb  9 19:36:29 2020     create seed table. 159 secs passed
[bsmap] @Sun Feb  9 19:36:29 2020     Pair-end alignment(8 threads),
  Input read file #1: ADCA1705_L001_R1_val_1.fq.gz    (format: gzipped FASTQ)
  Input read file #2: ADCA1705_L001_R2_val_2.fq.gz    (format: gzipped FASTQ)
  Output file: ADCA1705.bam    (format: SAM, automatically convert to BAM)
[bsmap] @Sun Feb  9 19:37:11 2020     create seed table. 169 secs passed
[bsmap] @Sun Feb  9 19:37:11 2020     Pair-end alignment(8 threads),
  Input read file #1: CDAA94_L001_R1_val_1.fq.gz  (format: gzipped FASTQ)
  Input read file #2: CDAA94_L001_R2_val_2.fq.gz  (format: gzipped FASTQ)
  Output file: CDAA94.bam  (format: SAM, automatically convert to BAM)
[bsmap] @Sun Feb  9 23:59:47 2020     total read pairs: 55375703  total time consumed:  15994 secs
  aligned pairs: 47692697 (86.1%), unique pairs: 45199732 (81.6%), non-unique pairs: 2492965 (4.5%)
  unpaired read #1: 3966235 (7.2%), unique reads: 3486447 (6.3%), non-unique reads: 479788 (0.9%)
  unpaired read #2: 2743656 (5.0%), unique reads: 2149537 (3.9%), non-unique reads: 594119 (1.1%)

Bash for loop should be okay, but the BAM should be sorted, and it is automatically done by the latest BSMAP.

demis001 commented 4 years ago

Yes it sorted,

I run this way:

 for SAMPLE  in `cat ./task_samples_aa`
 do
     bsmap -a ${SAMPLE}_L001_R1_val_1.fq.gz -b ${SAMPLE}_L001_R2_val_2.fq.gz -d ${GENODIR} -o ${SAMPLE}.bam  -p 14 -L 135  -w 100  -v 10 -q 10 -R -V 1
     samtools sort -@ 28  ${SAMPLE}.bam -o  ${SAMPLE}_sorted.bam
     samtools index  ${SAMPLE}_sorted.bam
     java -jar  $PICARDHOME/picard.jar MarkDuplicates \
            I=${SAMPLE}_sorted.bam \
            O=${SAMPLE}_sorted_dedup.bam \
            M=${SAMPLE}_sorted_marked_dup_metrics.txt REMOVE_DUPLICATES=true
     samtools index ${SAMPLE}_sorted_dedup.bam

 done
demis001 commented 4 years ago

I never used the bsmap to sort, I used samtools in bash loop as shown above. It is seems sorted without any problem. I don't see any log file about the sort order in bsmap log file. The mcall don't show anything about the bam file except this...

> mcall -m ADCA1665_sorted_dedup.bam
Options are saved in file run.config and printed here:
cytosineMinScore=20
excludedFlag=0
fullMode=0
keepTemp=0
mappedFiles=ADCA1665_sorted_dedup.bam 
minFragSize=0
minMMFragSize=0
nextBaseMinScore=3
processPEOverlapSeq=1
qualityScoreBase=0
reportCHX=X
reportCpX=G
requiredFlag=0
skipRandomChrom=1
statsOnly=0
threads=1
trimRRBSEndRepairSeq=2
trimWGBSEndRepairPE1Seq=3
trimWGBSEndRepairPE2Seq=3
Program started
From the extension of file ADCA1665_sorted_dedup.bam, program is parsing file according to BAM foramt
XR and ZS fields are found in first 5 mapped alignment
For file ADCA1665_sorted_dedup.bam, the quality score format is Sanger format based at 33!
Protocol and read length are detected as WGBS and 135 bases for file ADCA1665_sorted_dedup.bam
lijinbio commented 4 years ago

I didn't see any error in your running. Let's go back to the MCALL error. To debug this error, can you help run MCALL on one BAM file with 1 thread (-p 1)? It is expected to terminate very soon, and I think it is still likely the input problem, either the BAM file or the reference genome. We can debug from the error in the specific chromosome.

lijinbio commented 4 years ago

Protocol and read length are detected as WGBS and 135 bases for file ADCA1665_sorted_dedup.bam

The latest BSMAP in Bionconda will automatically sort and index the BAM file. Can you help debug the Bioconda version? because we have continuously upgraded MOABS recently.

demis001 commented 4 years ago

I tried to use the bioconda version on bismark bam file using the bioconda mcall, got an error and then you suggested me to use the bsmap. In the mean time, I thought it may be a version problem and installed the Linux binary version and used for alignment. I am running the bioconda mcall on one file right now ..

demis001 commented 4 years ago

The same thing .

> mcall -m ADAA411_sorted.bam -p 1
Options are saved in file run.config and printed here:
cytosineMinScore=20
excludedFlag=0
fullMode=0
keepTemp=0
mappedFiles=ADAA411_sorted.bam 
minFragSize=0
minMMFragSize=0
nextBaseMinScore=3
processPEOverlapSeq=1
qualityScoreBase=0
reportCHX=X
reportCpX=G
requiredFlag=0
skipRandomChrom=1
statsOnly=0
threads=1
trimRRBSEndRepairSeq=2
trimWGBSEndRepairPE1Seq=3
trimWGBSEndRepairPE2Seq=3
Program started
From the extension of file ADAA411_sorted.bam, program is parsing file according to BAM foramt
XR and ZS fields are found in first 5 mapped alignment
For file ADAA411_sorted.bam, the quality score format is Sanger format based at 33!
Protocol and read length are detected as WGBS and 135 bases for file ADAA411_sorted.bam
For file ADAA411_sorted.bam the number of all reads is 91259772 and the number of mapped reads is 91259772
Start processing file ADAA411_sorted.bam on chrom 1
terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::substr: __pos (which is 9999) > this->size() (which is 0)
Aborted (core dumped)
demis001 commented 4 years ago
> which mcall
$HOME/anaconda3/bin/mcall
demis001 commented 4 years ago

The bam file is sorted:

> samtools view ADAA411_sorted_dedup.bam | more
A00257:298:HMW5GDMXX:1:1351:5891:22122  339 1   10000   255 73M =   10008   -65 ATATCCCTATCCCTATCCCTATCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC   ,FF,F,:FF,:,,,F,F,,FF,F,F::FF,:FF,:,FFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:   PG:Z:MarkDuplicates NM:i:4  XR:Z:aaATAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCta  ZS:Z:-+
A00257:298:HMW5GDMXX:1:2366:32298:23766 339 1   10000   255 95M =   10026   -69 ATATCACTACCGCTAACGCTAACCCTAACCTTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC ,F:,,,,FF,:,,:::F,,FF,,,
FFF:FF,FFFF,FFFFF,FFFFFF:,:FF,FFFFF,FFFFFFFF:FF:FFFFFFFFFFFFFFF:FFFFFFF PG:Z:MarkDuplicates NM:i:6  XR:Z:aaATAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACcc    ZS:Z:-+
A00257:298:HMW5GDMXX:2:1370:16071:36996 369 1   10000   255 100M    X   156030611   0   ATTACGCTTACCCTAACTCTTACGCTAACCCTTACTCTAACCCTACCCCTAACCCTAACCCTAACCATAACCCTAACCCTAACCCTAACCCTAACCCTAA    ,F,:,,,F
,,,,,::,,,FF,FF,,F,,F,FF,,,,FF:,F,FFF,F:FFFF:FFFF,FFFFFFFF,FF:FFFFFFFFFFF:FFFF:FFFF:FFFFFFFF    PG:Z:MarkDuplicates NM:i:10 XR:Z:aaATAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC
CCTAACCCTAAcc   ZS:Z:-+
A00257:287:HMHY3DMXX:1:2135:5195:6104   355 1   10001   255 97M =   10005   101 TAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTT   F:FFF:FF,FF:FF,,
:FFF:,FFFFFFFFFF:FFFFF:FFFFFFFFFFF,FFFFF,FF:FFF:FFF,:FFFFFFF,FFFFF:FF::FFFFFF:FFF   PG:Z:MarkDuplicates NM:i:0  XR:Z:aaTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT
aa  ZS:Z:++
A00257:287:HMHY3DMXX:1:2388:31195:2440  419 1   10001   255 84M =   10005   88  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC    FFFFFFFFFFFF:FFFFF:FFFFFFFFFFFFF
FFFF:FFFFFFFFFFFFFF,FFFFFFFFFFFFFFF:FF:::FFF:FFFF:FF    PG:Z:MarkDuplicates NM:i:0  XR:Z:aaTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCta   ZS:Z:--
A00257:298:HMW5GDMXX:1:1476:2763:26944  419 1   10001   255 60M =   10042   100 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC    F:FFFFFFFFFFFFFFFFFF,FFFFFFFFF:FFFFFFFFFFFF,,FFFF,FFFFF,
,FFF    PG:Z:MarkDuplicates NM:i:0  XR:Z:aaTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCta   ZS:Z:--
A00257:298:HMW5GDMXX:2:1387:12509:27383 355 1   10001   255 61M =   10271   331 TAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTT   FFFFFFFF,FFFFF::FFFFFFFFFF::FFFF:FFFF:FFFFFFFFFFFFFFFFFF
FFFFF   PG:Z:MarkDuplicates NM:i:0  XR:Z:aaTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTaa  ZS:Z:++
A00257:287:HMHY3DMXX:1:1450:28375:31219 99  1   10003   255 135M    =   10012   143 ATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTAA
TTTTAATTTTAATTT FFFFFF,FFFF,FFFFF::FFFFFFFFFFFF:FFFFFFFFF:FFFFF:FFFFFF,FFFF:,F:FF:,F,FFFFFFFFF:FFFF:,F:FF:FFFFF:,FFFF,,F,FFF,:FFFF,FFFF,F:FF,,FFFF,,F,F PG:Z:MarkDuplicates NM:i:4  XR:Z:taACCCTAACCCTAACCCTAACCCTAACCCTAACC
CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCta    ZS:Z:++
A00257:287:HMHY3DMXX:2:1425:8151:12602  355 1   10003   255 48M =   10037   82  ATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTA    FFFFFFFFFFFF:FF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF    PG:Z:Mar
kDuplicates NM:i:0  XR:Z:taACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAac   ZS:Z:++
A00257:287:HMHY3DMXX:2:1441:6198:31548  403 1   10003   255 94M =   10005   -92 ATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTT  FFFFFFF:FFFFFFFFFFFF::FF
F,::FFF::::FFFFFFFFFFFFFFFFF:FF:FFFFFFFFFFF,FFFFF:F:FFFF:FFFFFFFFFFFFF  PG:Z:MarkDuplicates NM:i:0  XR:Z:taACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCta ZS:Z:+-
A00257:287:HMHY3DMXX:2:1323:22209:32189 419 1   10004   255 58M =   10013   98  CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF::F:FFFFFFFF:FF,F:F:F,F,
,F  PG:Z:MarkDuplicates NM:i:0  XR:Z:aaCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTaa ZS:Z:--
lijinbio commented 4 years ago

The same thing .

> mcall -m ADAA411_sorted.bam -p 1
Options are saved in file run.config and printed here:
cytosineMinScore=20
excludedFlag=0
fullMode=0
keepTemp=0
mappedFiles=ADAA411_sorted.bam 
minFragSize=0
minMMFragSize=0
nextBaseMinScore=3
processPEOverlapSeq=1
qualityScoreBase=0
reportCHX=X
reportCpX=G
requiredFlag=0
skipRandomChrom=1
statsOnly=0
threads=1
trimRRBSEndRepairSeq=2
trimWGBSEndRepairPE1Seq=3
trimWGBSEndRepairPE2Seq=3
Program started
From the extension of file ADAA411_sorted.bam, program is parsing file according to BAM foramt
XR and ZS fields are found in first 5 mapped alignment
For file ADAA411_sorted.bam, the quality score format is Sanger format based at 33!
Protocol and read length are detected as WGBS and 135 bases for file ADAA411_sorted.bam
For file ADAA411_sorted.bam the number of all reads is 91259772 and the number of mapped reads is 91259772
Start processing file ADAA411_sorted.bam on chrom 1
terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::substr: __pos (which is 9999) > this->size() (which is 0)
Aborted (core dumped)

Can you specify the reference genome when running MCALL? This error still says chromosome 1 is empty.

lijinbio commented 4 years ago

I tried to use the bioconda version on bismark bam file using the bioconda mcall, got an error and then you suggested me to use the bsmap. In the mean time, I thought it may be a version problem and installed the Linux binary version and used for alignment. I am running the bioconda mcall on one file right now ..

Thanks. Yes, during version changes, Bismark BAM is not suitable in our MCALL anymore. Our latest MOABS is always deployed in Bioconda, and it is encouraging to use this Bioconda version. Thank you.

demis001 commented 4 years ago

i doubt that is the case. Why the mcall from the same version through the same error.

head /data1/genome/hg38_r87/hg38_r87.fa 1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

demis001 commented 4 years ago
> mcall -m ADAA411_sorted_dedup.bam -p 1 -g  /data1/genome/hg38_r87/hg38_r87.fa
Options are saved in file run.config and printed here:
cytosineMinScore=20
excludedFlag=0
fullMode=0
genome=/data1/genome/hg38_r87/hg38_r87.fa
keepTemp=0
mappedFiles=ADAA411_sorted_dedup.bam 
minFragSize=0
minMMFragSize=0
nextBaseMinScore=3
processPEOverlapSeq=1
qualityScoreBase=0
reportCHX=X
reportCpX=G
requiredFlag=0
skipRandomChrom=1
statsOnly=0
threads=1
trimRRBSEndRepairSeq=2
trimWGBSEndRepairPE1Seq=3
trimWGBSEndRepairPE2Seq=3
Program started
From the extension of file ADAA411_sorted_dedup.bam, program is parsing file according to BAM foramt
XR and ZS fields are found in first 5 mapped alignment
For file ADAA411_sorted_dedup.bam, the quality score format is Sanger format based at 33!
Protocol and read length are detected as WGBS and 135 bases for file ADAA411_sorted_dedup.bam
For file ADAA411_sorted_dedup.bam the number of all reads is 87334862 and the number of mapped reads is 87334862
Start processing file ADAA411_sorted_dedup.bam on chrom 1
terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::substr: __pos (which is 9999) > this->size() (which is 0)
Aborted (core dumped)
demis001 commented 4 years ago

Let me change to the linux version mcall

lijinbio commented 4 years ago

That is great. The bug is found. When reading the reference genome, we assume the header has no other description, i.e. >1 only, not >1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF. I am fixing the code on my side. The update should be seen in the next release in Bioconda. Thank you for your report on this error.

demis001 commented 4 years ago

Let me see if that fixes, I can replace the headers in the fasta file

lijinbio commented 4 years ago

Let me see if that fixes, I can replace the headers in the fasta file

Yes, please help do so. Thank you.

demis001 commented 4 years ago

sed 's/dna:chromosome.*$//g' hg38_r87.fa > hg38_r87_update.fa

demis001 commented 4 years ago
> sed 's/dna:chromosome.*$//g' hg38_r87.fa | more
>1 
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
lijinbio commented 4 years ago

sed 's/dna:chromosome.*$//g' hg38_r87.fa > hg38_r87_update.fa

Possible there is a space after >1, i.e. >1?

demis001 commented 4 years ago

Ok! let me remove that, failed with the current update

lijinbio commented 4 years ago

Ok! let me remove that, failed with the current update

Thanks.

demis001 commented 4 years ago

Still failed. Added a space by adding space before " dna" and checked in vim too. Nothing exist

grep ">" hg38_r87_update.fa 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y MT

demis001 commented 4 years ago
> grep ">"  hg38_r87_update.fa 
>1
>2
>3
>4
>5
>6
>7
>8
>9
>10
>11
>12
>13
>14
>15
>16
>17
>18
>19
>20
>21
>22
>X
>Y
>MT
demis001 commented 4 years ago

Still failed:

> mcall -m ADAA411_sorted_dedup.bam -p 1 -g  /data1/genome/hg38_r87/hg38_r87_update.fa 
Options are saved in file run.config and printed here:
cytosineMinScore=20
excludedFlag=0
fullMode=0
genome=/data1/genome/hg38_r87/hg38_r87_update.fa
keepTemp=0
mappedFiles=ADAA411_sorted_dedup.bam 
minFragSize=0
minMMFragSize=0
nextBaseMinScore=3
processPEOverlapSeq=1
qualityScoreBase=0
reportCHX=X
reportCpX=G
requiredFlag=0
skipRandomChrom=1
statsOnly=0
threads=1
trimRRBSEndRepairSeq=2
trimWGBSEndRepairPE1Seq=3
trimWGBSEndRepairPE2Seq=3
Program started
From the extension of file ADAA411_sorted_dedup.bam, program is parsing file according to BAM foramt
XR and ZS fields are found in first 5 mapped alignment
For file ADAA411_sorted_dedup.bam, the quality score format is Sanger format based at 33!
Protocol and read length are detected as WGBS and 135 bases for file ADAA411_sorted_dedup.bam
For file ADAA411_sorted_dedup.bam the number of all reads is 87334862 and the number of mapped reads is 87334862
Start processing file ADAA411_sorted_dedup.bam on chrom 1
terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::substr: __pos (which is 9999) > this->size() (which is 0)
Aborted (core dumped)
demis001 commented 4 years ago

The bam files are close to 10 GB, I remember I run with 200 GB 3 years, I thought that may not be a problem. I have tried the older version too and gave the samething

demis001 commented 4 years ago
> which mcall
$HOME/bin/moabs-1.3.8.7/bin/mcall
> mcall -m ADAA411_sorted_dedup.bam -p 1 -g  /data1/genome/hg38_r87/hg38_r87_update.fa
Options are saved in file run.config and printed here:
cytosineMinScore=20
excludedFlag=0
fullMode=0
genome=/data1/genome/hg38_r87/hg38_r87_update.fa
keepTemp=0
mappedFiles=ADAA411_sorted_dedup.bam 
minFragSize=0
minMMFragSize=0
nextBaseMinScore=3
processPEOverlapSeq=1
qualityScoreBase=0
reportCHX=X
reportCpX=G
requiredFlag=0
skipRandomChrom=1
statsOnly=0
threads=1
trimRRBSEndRepairSeq=2
trimWGBSEndRepairPE1Seq=3
trimWGBSEndRepairPE2Seq=3
Program started
From the extension of file ADAA411_sorted_dedup.bam, program is parsing file according to BAM foramt
XR and ZS fields are found in first 5 mapped alignment
For file ADAA411_sorted_dedup.bam, the quality score format is Sanger format based at 33!
Protocol and read length are detected as WGBS and 135 bases for file ADAA411_sorted_dedup.bam
For file ADAA411_sorted_dedup.bam the number of all reads is 87334862 and the number of mapped reads is 87334862
Start processing file ADAA411_sorted_dedup.bam on chrom 1
terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::substr
Aborted (core dumped)
lijinbio commented 4 years ago

That is sad. It somehow still can not read the reference correctly.

demis001 commented 4 years ago

Any idea why?

Do it need fasta index?

demis001 commented 4 years ago

Can I index the new fasta file

lijinbio commented 4 years ago

No need to index the FASTA file. The index is not used when reading a chromosome.

You mentioned you may transfer to the Linux platform. I doubt if it is the DOS format problem? Can you help run file xx.fa to see if its new line is CRLF?

demis001 commented 4 years ago

Found it, I think the you have changed the options -g and -r, I think -r will fix. Let me see if that works! Thanks for your help!!!

lijinbio commented 4 years ago

Really?

demis001 commented 4 years ago

I am stupid, it is running now. But the file still empty, passed the previous stage

demis001 commented 4 years ago

I will check with bismark bam too, I think, it will work.

demis001 commented 4 years ago
YEP!!! working

100M -rw-rw-r-- 1 ddjima fuse 100M Feb 11 12:41 ADAA411_sorted_dedup.bam.1.G.bed
4.0K -rw-rw-r-- 1 ddjima fuse   94 Feb 11 12:37 ADAA411_sorted_dedup.bam.1.HG.bed
8.0K -rw-rw-r-- 1 ddjima fuse 4.5K Feb 11 12:41 ADAA411_sorted_dedup.bam.1_stat.txt