jtamames / SqueezeMeta

A complete pipeline for metagenomic analysis
GNU General Public License v3.0
348 stars 81 forks source link

STEP10 -> 10.mapsamples.pl. #694

Closed kostovci closed 9 months ago

kostovci commented 1 year ago

Hi there, after trying running analysis in the newest version of squeezemeta (1.6.2) I'm getting repeatedly this error message: Stopping in STEP10 -> 10.mapsamples.pl. Program finished abnormally .................................... Calculating contig coverage Reading contig length from /media/storage2/Ips_metatrancriptome_0122/211213_Svec_Veselska/raw_fastq/squeeze_euk_reorder_40c12/intermediate/01.squeeze_euk_reorder_40c12.lon [E::sam_hrecs_update_hashes] Duplicate entry "megahit_1039620" in sam header samtools view: failed to add PG line to the header Illegal division by zero at /home/ubuntukrc/miniconda3/envs/SqueezeMeta_1.6.2/SqueezeMeta/scripts/10.mapsamples.pl line 444. Adding syslog file. The same analysis with the same dataset in 1.5.0 version didn't output any errors. Thansk in advance for any hints what is going on. Martin

fpusan commented 1 year ago

Hi! I don't think the syslog file got attached. Can you zip the file and attach again?

kostovci commented 1 year ago

Ohh sorry ...here we go.. syslog[1].gz

fpusan commented 1 year ago

So what happens if you search the allegedly duplicated contig in your fasta file? grep megahit_1039620 /media/storage2/Ips_metatrancriptome_0122/211213_Svec_Veselska/raw_fastq/squeeze_euk_reorder_40c12/results/01.squeeze_euk_reorder_40c12.fasta?

kostovci commented 1 year ago

It'll find one occassion : >megahit_1039620 singleton A00703:142:HWCF7DSX2:1:1101:13964:1689

megahit_1039620 singleton A00703:142:HWCF7DSX2:1:1101:4246:16955

fpusan commented 1 year ago

Let's check the bowtie index then

/home/ubuntukrc/miniconda3/envs/SqueezeMeta_1.6.2/SqueezeMeta/bin/bowtie2/bowtie2-inspect /media/storage2/Ips_metatrancriptome_0122/211213_Svec_Veselska/raw_fastq/squeeze_euk_reorder_40c12/squeeze_euk_reorder_40c12.bowtie -n | grep megahit_1039620

Anyways I see you are using the singletons mode. Not sure on whether that contig (which is a singleton, meaning that it really comes from an unassembled read) should be part of the reference in the first place...

@jtamames what do you think?

kostovci commented 1 year ago

Could not locate a Bowtie index corresponding to basename "/media/storage2/Ips_metatrancriptome_0122/211213_Svec_Veselska/raw_fastq/squeeze_euk_reorder_40c12/squeeze_euk_reorder_40c12.bowtie" Error: Encountered internal Bowtie 2 exception (#1) I have several bowtie indexes in here : /media/storage2/Ips_metatrancriptome_0122/211213_Svec_Veselska/raw_fastq/squeeze_euk_reorder_40c12/data

..named : squeeze_euk_reorder_40c12.bowtie.rev.1.bt2 ; squeeze_euk_reorder_40c12.bowtie.rev.2.bt2 ; squeeze_euk_reorder_40c12.bowtie.1.bt2 ; squeeze_euk_reorder_40c12.bowtie.2.bt2 ; squeeze_euk_reorder_40c12.bowtie.3.bt2 ; squeeze_euk_reorder_40c12.bowtie.4.bt2 ; I'm not sure as well concernign singletons usage but wanted to include all reads in analysis as some important are from low abundance species ;(

fpusan commented 1 year ago

Ah yeah, try /home/ubuntukrc/miniconda3/envs/SqueezeMeta_1.6.2/SqueezeMeta/bin/bowtie2/bowtie2-inspect /media/storage2/Ips_metatrancriptome_0122/211213_Svec_Veselska/raw_fastq/squeeze_euk_reorder_40c12/data/squeeze_euk_reorder_40c12.bowtie -n | grep megahit_103962 instead.

The --singletons mode is normally used for long, single end reads. In this case if you have paired end reads. Paired end reads that do not assemble will have the same name (they maybe differ at the end of the name string, but if it's after a space bowtie will ignore that). So that could account for the duplicated entries in the index. Still I think then they should also have shown as duplicated in your fasta file...

fpusan commented 1 year ago

Just in case, can you remove all the index files in /data (the ones ending in *.bt2) and run step 10 again?

jtamames commented 1 year ago

Hello! Letś check the SAM file. Please do:

/home/ubuntukrc/miniconda3/envs/SqueezeMeta_1.6.2/SqueezeMeta/bin/bowtie2/bowtie2 -x /media/storage2/Ips_metatrancriptome_0122/211213_Svec_Veselska/raw_fastq/squeeze_euk_reorder_40c12/data/squeeze_euk_reorder_40c12.bowtie -1 /media/storage2/Ips_metatrancriptome_0122/211213_Svec_Veselska/raw_fastq/squeeze_euk_reorder_40c12/temp/squeeze_euk_reorder_40c12.P1.current_1.gz -2 /media/storage2/Ips_metatrancriptome_0122/211213_Svec_Veselska/raw_fastq/squeeze_euk_reorder_40c12/temp/squeeze_euk_reorder_40c12.P1.current_2.gz --quiet -p 12 -S /media/storage2/Ips_metatrancriptome_0122/211213_Svec_Veselska/raw_fastq/squeeze_euk_reorder_40c12/data/bam/squeeze_euk_reorder_40c12.P1.sam --very-sensitive-local

and then:

grep megahit_103962 /media/storage2/Ips_metatrancriptome_0122/211213_Svec_Veselska/raw_fastq/squeeze_euk_reorder_40c12/data/bam/squeeze_euk_reorder_40c12.P1.sam

Best, J

kostovci commented 1 year ago

So this is the output from bowtie-inspect: megahit_103962 megahit_1039620 singleton A00703:142:HWCF7DSX2:1:1101:13964:1689 megahit_1039621 singleton A00703:142:HWCF7DSX2:1:1101:12057:3145 megahit_1039622 singleton A00703:142:HWCF7DSX2:1:1101:9489:4648 megahit_1039623 singleton A00703:142:HWCF7DSX2:1:1101:22643:5102 megahit_1039624 singleton A00703:142:HWCF7DSX2:1:1101:22652:6089 megahit_1039625 singleton A00703:142:HWCF7DSX2:1:1101:30770:6496 megahit_1039626 singleton A00703:142:HWCF7DSX2:1:1101:32560:7435 megahit_1039627 singleton A00703:142:HWCF7DSX2:1:1101:32163:7623 megahit_1039628 singleton A00703:142:HWCF7DSX2:1:1101:17309:8296 megahit_1039629 singleton A00703:142:HWCF7DSX2:1:1101:9579:8656 megahit_1039620 singleton A00703:142:HWCF7DSX2:1:1101:4246:16955 megahit_1039621 singleton A00703:142:HWCF7DSX2:1:1102:8621:14105 megahit_1039622 singleton A00703:142:HWCF7DSX2:1:1103:19135:20823 megahit_1039623 singleton A00703:142:HWCF7DSX2:1:1103:24569:24126 megahit_1039624 singleton A00703:142:HWCF7DSX2:1:1103:15456:35086 megahit_1039625 singleton A00703:142:HWCF7DSX2:1:1106:7536:25003 megahit_1039626 singleton A00703:142:HWCF7DSX2:1:1107:19542:14794 megahit_1039628 singleton A00703:142:HWCF7DSX2:1:1108:5493:29575 Those eight reads are probably paired-end reads but I'm not sure about naming convention of megahit. Still the biggest mystery is that it didn't happen in older version of squeezemeta with the same data and the same specs of the run.

kostovci commented 1 year ago

I'm going to check sam files..give me a second

jtamames commented 1 year ago

Version 1.5 worked with SAM files. In 1.6 we switched to use BAM files, so behavior can differ

kostovci commented 1 year ago

Just in case, can you remove all the index files in /data (the ones ending in *.bt2) and run step 10 again?

SO after deleting all index files i gives me: Can't find sample file /media/storage2/Ips_metatrancriptome_0122/211213_Svec_Veselska/raw_fastq/squeeze_euk_reorder_40c12/data/raw_fastq/Svec04_R1.fastq.gz for sample P2A in the samples file. Please check And that file really is not in the folder ...do not know why..

kostovci commented 1 year ago

Ouch, I deleted index files first ...so now bowtie don't have what to process ;(....I'll try on different analysis that terminated in step 10.

fpusan commented 1 year ago

So apparently there are two different reads in the index that start with the same megahit code megahit_1039620 singleton A00703:142:HWCF7DSX2:1:1101:13964:1689 megahit_1039620 singleton A00703:142:HWCF7DSX2:1:1101:4246:16955 Now that I check your previous message, it seems like both were in the fasta file, right??

It'll find one occassion : >megahit_1039620 singleton A00703:142:HWCF7DSX2:1:1101:13964:1689 megahit_1039620 singleton A00703:142:HWCF7DSX2:1:1101:4246:16955

kostovci commented 1 year ago

So apparently there are two different reads in the index that start with the same megahit code megahit_1039620 singleton A00703:142:HWCF7DSX2:1:1101:13964:1689 megahit_1039620 singleton A00703:142:HWCF7DSX2:1:1101:4246:16955 Now that I check your previous message, it seems like both were in the fasta file, right??

It'll find one occassion : >megahit_1039620 singleton A00703:142:HWCF7DSX2:1:1101:13964:1689 megahit_1039620 singleton A00703:142:HWCF7DSX2:1:1101:4246:16955

That's right.

kostovci commented 1 year ago

grep megahit_103962 /media/storage2/Ips_metatrancriptome_0122/211213_Svec_Veselska/raw_fastq/squeeze_euk_reorder_40c12/data/bam/squeeze_euk_reorder_40c12.P1.sam

So after runing abovementined commands on different run (as the indexes were deleted ) it gave me this oputput : @SQ SN:megahit_103962 LN:371 @SQ SN:megahit_1039620 LN:161 @SQ SN:megahit_1039621 LN:161 @SQ SN:megahit_1039622 LN:161 @SQ SN:megahit_1039623 LN:161 @SQ SN:megahit_1039624 LN:161 @SQ SN:megahit_1039625 LN:161 @SQ SN:megahit_1039626 LN:161 @SQ SN:megahit_1039627 LN:161 @SQ SN:megahit_1039628 LN:161 @SQ SN:megahit_1039629 LN:161

fpusan commented 1 year ago

Any chance you can share the /media/storage2/Ips_metatrancriptome_0122/211213_Svec_Veselska/raw_fastq/squeeze_euk_reorder_40c12/results/01.squeeze_euk_reorder_40c12.fasta file with us?

kostovci commented 1 year ago

No problem ..here it is https://www.myqnapcloud.com/smartshare/6c6f36min2opo66v8210v2z9_c82ie511m55723ssq2v27zbyyzf1e015

fpusan commented 1 year ago

Sorry for the late answer. I can confirm that there are "kinda duplicated" singleton reads in your fasta file, e.g.

>megahit_1162733 singleton A00703:142:HWCF7DSX2:3:2439:27552:1454
>megahit_1162733 singleton A00703:142:HWCF7DSX2:4:2648:6885:18803

I suspect this is causing a problem later when samtools only considers the string before the first space (e.g. megahit_1162733) when checking for duplicates.

I wonder if this is a byproduct of having restarted the project at some point during step one. Do you remember if this was the case? Can you try to rerun and see if the problem happens agian? @jtamames what do you think?

kostovci commented 1 year ago

Hi, thanks for checking. No for sure not - any of the scenarious you mentioned. It always happend after the first and only start of the analysis. I already tried different assembeler, RNAspades recetnly and with or without cleaning but always using singletons. Now I'm running analysis without singletons option so will see. Anyway I can still use the older version to get results ...but I'm just curious what is causing this problem in 1.6.2 . Thanks for you assisstance anyway.

fpusan commented 1 year ago

In 1.6.2 we are using samtools to compress the SAM results into the BAM format (and later read them to perform counting). The root of the issue was present before, but didn't lead to an explicit error since we didn't use samtools that way.

kostovci commented 1 year ago

So after rerunnig analysis without cleaning and singletons options it has finished.

fpusan commented 1 year ago

Glad to hear, that's something. We will look further into the issue with the singletons mode

kostovci commented 1 year ago

Thanks for helping with finding the source of ot issue and I'll be curious if the "singletons mystery" will be uncovered. Martin