jtamames / SqueezeMeta

A complete pipeline for metagenomic analysis
GNU General Public License v3.0
346 stars 81 forks source link

Binning Issues - Failure to generate outputs in intermediate/binners folder. #829

Closed MicroSeq closed 2 months ago

MicroSeq commented 2 months ago

Hello,

There appears to be a problem with the binning step. It looks like the .bam files are not being properly passed for jgi_depths maybe per the syslog messages? All the files are blank inside of the binner folders despite there being a large bincontigs.fasta file in temp. I've run the same samples through another workflow and successfully recovered bins. The workflow is otherwise completing to step 21. This is being executed on a HPC node using Ubuntu 20.04. Let me know if I can provide any other details, but I did not think it was quite the same as Issue 656 where it was a permissions problem. All of the files output in the binners folder are blank except for the DAS_tool log which has:

GNU nano 4.8 intermediate/binners/DAS/BB_DASTool.log scaffolds2bin file not found: -l Aborting.

syslog.zip

fpusan commented 2 months ago

What is the content of the /gpfs/fs7/grdi/genarcc/grdi_eco/groups/beaudettel/des003/STAGE_products/cleaned/BB/data/bam directory? Also are there results for the other binners you used (MaxBin and CONCOCT) in the /gpfs/fs7/grdi/genarcc/grdi_eco/groups/beaudettel/des003/STAGE_products/cleaned/BB/intermediate/binners directory? (there should be one directory per method with non-empty fasta files inside).

fpusan commented 2 months ago

Also what is the content of /gpfs/fs7/grdi/genarcc/grdi_eco/groups/beaudettel/des003/STAGE_products/cleaned/BB/data/00.BB.samples?

jtamames commented 2 months ago

And also it would be helpful to have the result oftree BBin the /gpfs/fs7/grdi/genarcc/grdi_eco/groups/beaudettel/des003/STAGE_products/cleaned directory

MicroSeq commented 2 months ago

/gpfs/fs7/grdi/genarcc/grdi_eco/groups/beaudettel/des003/STAGE_products/cleaned/BB/data/bam

BB.BB_rep1.bam BB.BB_rep1.bam.bai BB.BB_rep2.bam BB.BB_rep2.bam.bai BB.BB_rep3.bam BB.BB_rep3.bam.bai

/gpfs/fs7/grdi/genarcc/grdi_eco/groups/beaudettel/des003/STAGE_products/cleaned/BB/intermediate/binners

concoct: concoct_int coverage_table.tsv - coverage_table file is blank

maxbin:

abund.list - file is blank

metabat2:

contigs.depth.txt - file is blank

Also what is the content of /gpfs/fs7/grdi/genarcc/grdi_eco/groups/beaudettel/des003/STAGE_products/cleaned/BB/data/00.BB.samples

BB_rep1 BB_SolidProduct09_IncubatedandFiltered_ZymoBIOMICSKit_RNA_rep1_S28_cleaned_R1.fastq.gz pair1 noassembly nobinning BB_rep1 BB_SolidProduct09_IncubatedandFiltered_ZymoBIOMICSKit_RNA_rep1_S28_cleaned_R2.fastq.gz pair2 noassembly nobinning BB_rep2 BB_SolidProduct09_IncubatedandFiltered_ZymoBIOMICSKit_RNA_rep2_S29_cleaned_R1.fastq.gz pair1 noassembly nobinning BB_rep2 BB_SolidProduct09_IncubatedandFiltered_ZymoBIOMICSKit_RNA_rep2_S29_cleaned_R2.fastq.gz pair2 noassembly nobinning BB_rep3 BB_SolidProduct09_IncubatedandFiltered_ZymoBIOMICSKit_RNA_rep3_S30_cleaned_R1.fastq.gz pair1 noassembly nobinning BB_rep3 BB_SolidProduct09_IncubatedandFiltered_ZymoBIOMICSKit_RNA_rep3_S30_cleaned_R2.fastq.gz pair2 noassembly nobinning BB_rep1 BB_SolidProduct09_IncubatedandFiltered_ZymoBIOMICSKit_DNAPurified_rep1_S20_cleaned_R1.fastq.gz pair1 BB_rep1 BB_SolidProduct09_IncubatedandFiltered_ZymoBIOMICSKit_DNAPurified_rep1_S20_cleaned_R2.fastq.gz pair2 BB_rep2 BB_SolidProduct09_IncubatedandFiltered_ZymoBIOMICSKit_DNAPurified_rep2_S21_cleaned_R1.fastq.gz pair1 BB_rep2 BB_SolidProduct09_IncubatedandFiltered_ZymoBIOMICSKit_DNAPurified_rep2_S21_cleaned_R2.fastq.gz pair2 BB_rep3 BB_SolidProduct09_IncubatedandFiltered_ZymoBIOMICSKit_DNAPurified_rep3_S22_cleaned_R1.fastq.gz pair1 BB_rep3 BB_SolidProduct09_IncubatedandFiltered_ZymoBIOMICSKit_DNAPurified_rep3_S22_cleaned_R2.fastq.gz pair2

MicroSeq commented 2 months ago

And also it would be helpful to have the result oftree BBin the /gpfs/fs7/grdi/genarcc/grdi_eco/groups/beaudettel/des003/STAGE_products/cleaned directory

Not sure what you mean here? You want the files in this directory associated with BB?

des000@inter-eccc-ubuntu2004:/gpfs/fs7/grdi/genarcc/grdi_eco/groups/beaudettel/des003/STAGE_products/cleaned$ ls -l BB* -rw-r--r-- 1 des000 grdi_eccc_beaudettel 1386 Apr 19 18:39 BB.tsv -rw-r--r-- 1 des000 grdi_eccc_beaudettel 865116335 Apr 19 17:57 BB_SolidProduct09_IncubatedandFiltered_ZymoBIOMICSKit_DNAPurified_rep1_S20_cleaned_R1.fastq.gz -rw-r--r-- 1 des000 grdi_eccc_beaudettel 893095552 Apr 19 17:57 BB_SolidProduct09_IncubatedandFiltered_ZymoBIOMICSKit_DNAPurified_rep1_S20_cleaned_R2.fastq.gz -rw-r--r-- 1 des000 grdi_eccc_beaudettel 960187202 Apr 19 17:57 BB_SolidProduct09_IncubatedandFiltered_ZymoBIOMICSKit_DNAPurified_rep2_S21_cleaned_R1.fastq.gz -rw-r--r-- 1 des000 grdi_eccc_beaudettel 996067207 Apr 19 17:57 BB_SolidProduct09_IncubatedandFiltered_ZymoBIOMICSKit_DNAPurified_rep2_S21_cleaned_R2.fastq.gz -rw-r--r-- 1 des000 grdi_eccc_beaudettel 977063194 Apr 19 17:58 BB_SolidProduct09_IncubatedandFiltered_ZymoBIOMICSKit_DNAPurified_rep3_S22_cleaned_R1.fastq.gz -rw-r--r-- 1 des000 grdi_eccc_beaudettel 1004650019 Apr 19 17:58 BB_SolidProduct09_IncubatedandFiltered_ZymoBIOMICSKit_DNAPurified_rep3_S22_cleaned_R2.fastq.gz -rw-r--r-- 1 des000 grdi_eccc_beaudettel 589701526 Apr 19 18:11 BB_SolidProduct09_IncubatedandFiltered_ZymoBIOMICSKit_RNA_rep1_S28_cleaned_R1.fastq.gz -rw-r--r-- 1 des000 grdi_eccc_beaudettel 585257032 Apr 19 18:11 BB_SolidProduct09_IncubatedandFiltered_ZymoBIOMICSKit_RNA_rep1_S28_cleaned_R2.fastq.gz -rw-r--r-- 1 des000 grdi_eccc_beaudettel 753254871 Apr 19 18:11 BB_SolidProduct09_IncubatedandFiltered_ZymoBIOMICSKit_RNA_rep2_S29_cleaned_R1.fastq.gz -rw-r--r-- 1 des000 grdi_eccc_beaudettel 756019035 Apr 19 18:11 BB_SolidProduct09_IncubatedandFiltered_ZymoBIOMICSKit_RNA_rep2_S29_cleaned_R2.fastq.gz -rw-r--r-- 1 des000 grdi_eccc_beaudettel 874685726 Apr 19 18:12 BB_SolidProduct09_IncubatedandFiltered_ZymoBIOMICSKit_RNA_rep3_S30_cleaned_R1.fastq.gz -rw-r--r-- 1 des000 grdi_eccc_beaudettel 866920464 Apr 19 18:12 BB_SolidProduct09_IncubatedandFiltered_ZymoBIOMICSKit_RNA_rep3_S30_cleaned_R2.fastq.gz

BB: total 276 -rw-r--r-- 1 des000 grdi_eccc_beaudettel 8482 Apr 22 21:32 SqueezeMeta_conf.pl -rw-r--r-- 1 des000 grdi_eccc_beaudettel 35 Apr 19 19:12 creator.txt drwxr-sr-x 5 des000 grdi_eccc_beaudettel 4096 Apr 23 14:43 data drwxr-sr-x 2 des000 grdi_eccc_beaudettel 4096 Apr 20 01:02 ext_tables drwxr-sr-x 3 des000 grdi_eccc_beaudettel 4096 Apr 22 20:41 intermediate -rw-r--r-- 1 des000 grdi_eccc_beaudettel 1651 Apr 22 21:41 methods.txt -rw-r--r-- 1 des000 grdi_eccc_beaudettel 3166 Apr 19 19:12 parameters.pl -rw-r--r-- 1 des000 grdi_eccc_beaudettel 314 Apr 20 01:04 progress drwxr-sr-x 3 des000 grdi_eccc_beaudettel 4096 Apr 22 22:31 results -rw-r--r-- 1 des000 grdi_eccc_beaudettel 65484 Apr 22 21:41 syslog drwxr-sr-x 2 des000 grdi_eccc_beaudettel 16384 Apr 22 21:41 temp

I have been trimming the files with fastp outside of the workflow as there are some bugs associated with this in the workflow that cause issues with the restart function (it looks like the RNA reads are not kept after cleaning), plus copies of the raw reads are kept in the project dir I believe?

fpusan commented 2 months ago

I see what is happening here.

In your samples file there are two sets of files for the different samples BB_rep1, BB_rep2, BB_rep3.

This is fine in principle, SqueezeMeta will just concatenate the different libraries belonging to the same sample and work with that. So this is why you have one BAM file per sample. This is the intended behaviour.

However, you also specify the nobinning flag for the three samples in the first set of files (first 6 lines in your samples file). So SqueezeMeta is ignoring those samples during binning, with leaves no valid BAM files, hence the error you've been getting.

I see that these first three libraries correspond to RNA instead of DNA libraries, so then it makes sense to add the noassembly and nobinning flags. But then you should not use the same sample names (since anyways you want to later be able to distinguish which reads came from the DNA library and which came from the RNA library.

So a samples file looking like this one (example shown for BB_rep1) should give you what you need.

BB_rep1_RNA BB_SolidProduct09_IncubatedandFiltered_ZymoBIOMICSKit_RNA_rep1_S28_cleaned_R1.fastq.gz pair1 noassembly nobinning
BB_rep1_RNA BB_SolidProduct09_IncubatedandFiltered_ZymoBIOMICSKit_RNA_rep1_S28_cleaned_R2.fastq.gz pair2 noassembly nobinning
BB_rep1_DNA BB_SolidProduct09_IncubatedandFiltered_ZymoBIOMICSKit_DNAPurified_rep1_S20_cleaned_R1.fastq.gz pair1
BB_rep1_DNA BB_SolidProduct09_IncubatedandFiltered_ZymoBIOMICSKit_DNAPurified_rep1_S20_cleaned_R2.fastq.gz pair2
MicroSeq commented 2 months ago

I see what is happening here.

In your samples file there are two sets of files for the different samples BB_rep1, BB_rep2, BB_rep3.

This is fine in principle, SqueezeMeta will just concatenate the different libraries belonging to the same sample and work with that. So this is why you have one BAM file per sample. This is the intended behaviour.

However, you also specify the nobinning flag for the three samples in the first set of files (first 6 lines in your samples file). So SqueezeMeta is ignoring those samples during binning, with leaves no valid BAM files, hence the error you've been getting.

I see that these first three libraries correspond to RNA instead of DNA libraries, so then it makes sense to add the noassembly and nobinning flags. But then you should not use the same sample names (since anyways you want to later be able to distinguish which reads came from the DNA library and which came from the RNA library.

So a samples file looking like this one (example shown for BB_rep1) should give you what you need.

BB_rep1_RNA BB_SolidProduct09_IncubatedandFiltered_ZymoBIOMICSKit_RNA_rep1_S28_cleaned_R1.fastq.gz pair1 noassembly nobinning
BB_rep1_RNA BB_SolidProduct09_IncubatedandFiltered_ZymoBIOMICSKit_RNA_rep1_S28_cleaned_R2.fastq.gz pair2 noassembly nobinning
BB_rep1_DNA BB_SolidProduct09_IncubatedandFiltered_ZymoBIOMICSKit_DNAPurified_rep1_S20_cleaned_R1.fastq.gz pair1
BB_rep1_DNA BB_SolidProduct09_IncubatedandFiltered_ZymoBIOMICSKit_DNAPurified_rep1_S20_cleaned_R2.fastq.gz pair2

Ahhh, silly misunderstanding on my end then. Thank you for clarifying! For some reason, I though the additional flags would account for a separate RNA workflow for matched samples. This may be also what was causing issues with the -restart function when cleaning the samples due to a lack of unique sample name.

fpusan commented 2 months ago

No problem, just added a patch that will detect this case and die with a meaningful error message if binning is attempted. Closing this now, let us know if you have any other issue!