jtamames / SqueezeMeta

A complete pipeline for metagenomic analysis
GNU General Public License v3.0
365 stars 78 forks source link

issue with subsetBins in SQMtools #589

Closed nutrimol closed 1 year ago

nutrimol commented 1 year ago

Dear developers,

First of all, I apologize for this long long post.

I am facing some issues with subsetBins function in SQMtools (v1.6) that I do not understand. I run a Squeezemeta project with three metagenomic and three metatranscriptomic samples. I run the command in co-assembly mode but I supplied an external assembly and in samples.samples file I specified nobinning option for the metatranscriptomic samples. Now, in SQMtools, if I look for the coverage in a particular contig, I get this output:

> SQM$bins$cov['concoct.92.fa.contigs',]
    c40_Id0     c40_Ld3    c40_Ld11 l100_MTXbc1 l100_MTXbc2 l100_MTXbc3 
      9.941      42.426     320.272       3.069      31.195      20.603 

I am trying to explore the binning results in SQMtools. When I subset the SQM object to keep the high quality bins with subsetBins function, I miss the coverage values, but only in c40_Ld3 and C40_Ld11 samples (two metagenome samples). In the first metagenome sample (c40_Id0) and the three MTX samples (metatranscriptomes) the values are the same as in the SQM object:

> HiBinNames = rownames(SQM$bins$table)[SQM$bins$table$Completeness>90 & SQM$bins$table$Contamination<10]
> HiBins = subsetBins(SQM, HiBinNames)
> HiBins$bins$cov['concoct.92.fa.contigs',]
      c40_Id0       c40_Ld3      c40_Ld11   l100_MTXbc1   l100_MTXbc2   l100_MTXbc3 
 9.941221e+00 2.075076e-322 1.581010e-321  3.068889e+00  3.119514e+01  2.060262e+01 

And the same applies to $bases and $cpm values:

> HiBins$bins$bases['concoct.92.fa.contigs',]
      c40_Id0       c40_Ld3      c40_Ld11   l100_MTXbc1   l100_MTXbc2   l100_MTXbc3 
 6.336472e+07 1.336055e-315 1.008581e-314  1.956090e+07  1.988359e+08  1.313198e+08 
> HiBins$bins$cpm['concoct.92.fa.contigs',]
      c40_Id0       c40_Ld3      c40_Ld11   l100_MTXbc1   l100_MTXbc2   l100_MTXbc3 
 4.972524e-01 1.482197e-323 2.272702e-322  7.115670e-01  7.335347e+00  7.243278e+00

Interestingly, when I look for the contig in bins$table of the subset, the coverage values are there for all the samples.

> HiBins$bins$table['concoct.92.fa.contigs',4:22]
                       Length GC perc Num contigs Disparity Completeness Contamination Strain het Coverage c40_Id0 TPM c40_Id0 Coverage c40_Ld3 TPM c40_Ld3 Coverage c40_Ld11 TPM c40_Ld11 Coverage l100_MTXbc1 TPM l100_MTXbc1 Coverage l100_MTXbc2 TPM l100_MTXbc2 Coverage l100_MTXbc3 TPM l100_MTXbc3
concoct.92.fa.contigs 6373937   49.36          58         0         95.6          9.78         25            9.941    1162.671           42.426    6512.114           320.272      55869.1                3.069        8912.747               31.195        90856.66               20.603        89716.31

Unfortunately, the plotBins function seems to take the values from object$bin$cov and object$bin$cpm to plot the bins, so I cannot get good representations of these data unless I select "abund" or "percent" as the variable for plotting.

I am a newbie in R, and I am having a hard time with this. Please, could you help me to understand what is going on? probably I am messing things up... Another big unclear point for me is whether is it normal to get some output in the metatranscriptomic (MTX) samples considering that I specified nobinning in samples.samples file?

Thank you in advance for your time. All the best, Virginia

fpusan commented 1 year ago

Thanks for the detailed report. This indeed seems like a bug. For now, you can plot the bins of interest directly, without the need for running subsetBins.

HiBinNames = rownames(SQM$bins$table)[SQM$bins$table$Completeness>90 & SQM$bins$table$Contamination<10]
plotBins(SQM, bins=HiBinNames, count="cpm")

Any chance you could share your project with me to see if I can reproduce the bug? I don't need everything, just get inside the project directory and copy the SqueezeMeta_conf.pl file, and the results and intermediate directories.

nutrimol commented 1 year ago

Thank you Fernando. Of course I will share it with you. What is the best way to do that? I have already compressed them but still they are to heavy to upload them here..

fpusan commented 1 year ago

Wetransfer or google drive usually work

nutrimol commented 1 year ago

Dear Fernando, I sent you an email (fernando.puente.sanchez@slu.se) with the link to the data. I am not sure if you could access the data. Thank you, Virginia

fpusan commented 1 year ago

Yes, I'll look into it later

fpusan commented 1 year ago

Can reproduce

fpusan commented 1 year ago

Try to add the option engine="data.frame" when loading the project with loadSQM. This should get you the right counts after sub-setting.

nutrimol commented 1 year ago

Excellent! It works! Thank you so much!!!