Open mradz19 opened 1 month ago
Any chance this is happening when resuming a previously saved R session?
Does it persist after you run library(data.table); library(SQMtools)
?
Yes, I can reproduce that exact error by 1) loading a project in SQMtools
library(SQMtools)
SQM=loadSQM("mock1")
2) Saving the session and quitting R
save.image("foo.RData")
quit(save="no")
3) Restoring the session and trying to subset
library(SQMtools)
load("foo.RData")
x=subsetTax(SQM,"phylum","Pseudomonadota")
# Error in `[.data.frame`(x, , j, with = FALSE, drop = TRUE) :
# unused argument (with = FALSE)
4) Issue dissapears if I manually load data.table
library(data.table)
x=subsetTax(SQM, "phylum", "Pseudomonadota") # this works
The question is that in https://github.com/jtamames/SqueezeMeta/blob/edf13b13a8336f5952b2ee9cd896be816036f3c4/lib/SQMtools/R/genericTable.R#L55 we rely on data.table:::`[.data.table`
being part of our environment, but this does not happen automatically when loading SQMtools. Instead, it only seems to happen after running loadSQM
library(SQMtools)
sessionInfo()
# R version 4.3.2 (2023-10-31)
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
# other attached packages:
# [1] SQMtools_1.7.0
# loaded via a namespace (and not attached):
# [1] compiler_4.3.2
SQM = loadSQM("mock1")
sessionInfo()
# R version 4.3.2 (2023-10-31)
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
# other attached packages:
# [1] SQMtools_1.7.0
# loaded via a namespace (and not attached):
# [1] compiler_4.3.2 data.table_1.14.8
Yes this was using an RDS file that had been transferred between R sessions. Doing it in the original session I am having no issues.
However I have noticed that in my bin_table most of the bins have NA in the completeness, contamination and strain het columns. I have never seen that before, what would be causing that data to be missing?
Maybe some issue running checkm for those bins in particular?
Can you share the syslog
file here? that should include the output from checkm
Yep sure this is the syslog: syslog.txt
I processed my SQM object the following way:
bacteria=subsetTax(project, 'superkingdom', 'Bacteria', trusted_functions_only = T)
goodORFs = rownames(bacteria$orfs$tax)[bacteria$orfs$tax[,'superkingdom']=='Bacteria']
b.good = subsetORFs(bacteria, goodORFs, tax_source='orfs', trusted_functions_only = T, ignore_unclassified_functions = T)
bin_table=b.good$bins$table
It was in the table output that I noticed most of the bins had NAs for completeness and contamination. Which I haven't had in the past.
Here is an example of the table:
I suspect the bins with missing checkm info are coming from eukaryotes (which CheckM does not cover). Can you confirm by looking at the SqueezeMeta taxonomy in those bins?
I don't think that is the case as when I look at the bin_table.csv the tax for all of them is K_bacteria;...
I checked the actual .fa.contigs.fa.tax files as well and the consensus tax all seem to be bacteria, for example:
nano metabat2.1187.fa.contigs.fa.tax
List of taxa (abundance >= 1%): k: Bacteria (99.8);p: Proteobacteria (98.2);c: Betaproteobacteria (90.5) Alphaproteobacteria (1.8);o: Burkholderiales (60.2) Hyphomicrobiales (1.4);f: Comamonadaceae (29.5);g: Polaromonas (2.4) Limnohabitans (1.0);s: Betaproteobacteri>
SqueezeMeta tax: k_Bacteria;p_Proteobacteria;c_Betaproteobacteria;o_Burkholderiales
Consensus: k_Bacteria;p_Proteobacteria;c_Betaproteobacteria;o_Burkholderiales; Total size: 2686974 Disparity: 0.089 Consensus mode: s
However the completeness and contamination values for this bin are both NA
I did go ahead and do a separate install of checkM and ran it successfully on metabat2.1187.fa.contigs.fa:
checkm lineage_wf -t 8 bins/ test/
[2024-06-06 11:45:32] INFO: CheckM v1.2.2
[2024-06-06 11:45:32] INFO: checkm lineage_wf -t 8 bins/ test/
[2024-06-06 11:45:32] INFO: CheckM data: data/
[2024-06-06 11:45:32] INFO: [CheckM - tree] Placing bins in reference genome tree.
[2024-06-06 11:45:32] INFO: Identifying marker genes in 1 bins with 8 threads:
Finished processing 1 of 1 (100.00%) bins.
[2024-06-06 11:45:50] INFO: Saving HMM info to file.
[2024-06-06 11:45:50] INFO: Calculating genome statistics for 1 bins with 8 threads:
Finished processing 1 of 1 (100.00%) bins.
[2024-06-06 11:45:50] INFO: Extracting marker genes to align.
[2024-06-06 11:45:50] INFO: Parsing HMM hits to marker genes:
Finished parsing hits for 1 of 1 (100.00%) bins.
[2024-06-06 11:45:50] INFO: Extracting 43 HMMs with 8 threads:
Finished extracting 43 of 43 (100.00%) HMMs.
[2024-06-06 11:45:50] INFO: Aligning 43 marker genes with 8 threads:
Finished aligning 43 of 43 (100.00%) marker genes.
[2024-06-06 11:45:50] INFO: Reading marker alignment files.
[2024-06-06 11:45:50] INFO: Concatenating alignments.
[2024-06-06 11:45:50] INFO: Placing 1 bins into the genome tree with pplacer (be patient).
[2024-06-06 11:47:57] INFO: { Current stage: 0:02:25.649 || Total: 0:02:25.649 }
[2024-06-06 11:47:57] INFO: [CheckM - lineage_set] Inferring lineage-specific marker sets.
[2024-06-06 11:47:57] INFO: Reading HMM info from file.
[2024-06-06 11:47:57] INFO: Parsing HMM hits to marker genes:
Finished parsing hits for 1 of 1 (100.00%) bins.
[2024-06-06 11:47:57] INFO: Determining marker sets for each genome bin.
Finished processing 1 of 1 (100.00%) bins (current: metabat2.1187.fa.contigs).
[2024-06-06 11:47:58] INFO: Marker set written to: test/lineage.ms
[2024-06-06 11:47:58] INFO: { Current stage: 0:00:01.248 || Total: 0:02:26.898 }
[2024-06-06 11:47:58] INFO: [CheckM - analyze] Identifying marker genes in bins.
[2024-06-06 11:47:59] INFO: Identifying marker genes in 1 bins with 8 threads:
Finished processing 1 of 1 (100.00%) bins.
[2024-06-06 11:49:23] INFO: Saving HMM info to file.
[2024-06-06 11:49:23] INFO: { Current stage: 0:01:25.023 || Total: 0:03:51.921 }
[2024-06-06 11:49:23] INFO: Parsing HMM hits to marker genes:
Finished parsing hits for 1 of 1 (100.00%) bins.
[2024-06-06 11:49:24] INFO: Aligning marker genes with multiple hits in a single bin:
Finished processing 1 of 1 (100.00%) bins.
[2024-06-06 11:49:24] INFO: { Current stage: 0:00:00.853 || Total: 0:03:52.775 }
[2024-06-06 11:49:24] INFO: Calculating genome statistics for 1 bins with 8 threads:
Finished processing 1 of 1 (100.00%) bins.
[2024-06-06 11:49:25] INFO: { Current stage: 0:00:00.182 || Total: 0:03:52.958 }
[2024-06-06 11:49:25] INFO: [CheckM - qa] Tabulating genome statistics.
[2024-06-06 11:49:25] INFO: Calculating AAI between multi-copy marker genes.
[2024-06-06 11:49:25] INFO: Reading HMM info from file.
[2024-06-06 11:49:25] INFO: Parsing HMM hits to marker genes:
Finished parsing hits for 1 of 1 (100.00%) bins.
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Bin Id Marker lineage # genomes # markers # marker sets 0 1 2 3 4 5+ Completeness Contamination Strain heterogeneity
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
metabat2.1187.fa.contigs o__Burkholderiales (UID4000) 193 427 214 155 261 11 0 0 0 66.75 2.96 27.27
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
As you can see the contamination and completeness was calculated correctly
Yes, it seems that CheckM should have worked correctly for that one. My recommendation here would be to locate the contigs that are really from prokaryotes by loking at the bin table and running CheckM manually for those.
We are actually going to upgrade to CheckM2 for the next version of SqueezeMeta (which should come sometime in summer) and we have rewritten the corresponding SqueezeMeta scripts, so whatever is causing this bug will not be present (though of course new bugs might!).
Sorry another issue. I have loaded the project into R using loadSQM, but when trying to subset taxa I am getting an error which seems to be driven by an issue with the orf$table