jtamames / SqueezeMeta

A complete pipeline for metagenomic analysis
GNU General Public License v3.0
348 stars 81 forks source link

Help with subsetSQM() for a list of KEGG/PFAM IDs (or a list of genes) #689

Closed henslerk closed 1 year ago

henslerk commented 1 year ago

Hello, I tried to filter my original SQM object with subsetSQM <- subsetFun(sqm, fun = hmms) with the fun= being a vector hmms <- gen_list$hmms created from a text file with a list of hmms loaded with gen_list <- read.delim("gen_list.txt", header = TRUE, sep = "\t") with two columns (gene and hmms) (ie. PF12942|PF02461|TIGR03080; TIGR00836; PF18582; K00372|K00367; etc.). The object size decreased from 13.06GB to 5.89GB, so I had thought it was successful; however, once I looked into the subsetSQM$orfs$table I realized that the subsetted object still contains almost 10k unique genes. Do you know what has been filtered out for the object for it to decrease in size? Also, can you please advise on how to best use the subsetFun() to filter for a list of IDs/genes (17 to be exact)?

henslerk commented 1 year ago

I am going to try quickly with the genes vector and a specification for the column "Gene names" in the orfs$table

fpusan commented 1 year ago

hmms should be a string containing a regular expression. So you would need to have a single string, in which the different search terms are joined by the | delimiter, rather than a vector... E.g. subsetFun(SQM, 'term1|term2|term3') You can also add the parameter columns="PFAM" to restrict the search to that column

henslerk commented 1 year ago

Okay, thank you. I tried doing that without the columns specified and it returned a gene for every hmms minus those that are TIGRFAM.. I then tried a filter with the specific gene names and it returned more than the specified genes. With the columns="PFAM", only 1/17 of the genes are returned

fpusan commented 1 year ago

You can run unique(subsetSQM$orfs$table$PFAM) to make sure that you only got the PFAMs that you wanted.

henslerk commented 1 year ago

This is my string: PF12942|PF02461|TIGR03080|TIGR00836|PF18582|K00372|K00367|TIGR01287|K00368|K15864|TIGR04244|TIGR04246|K15576|TIGR01580|TIGR01792|K01999|K00053|K11073|TIGR02712|TIGR01729|TIGR01096 , however with that unique command, only 4 PFAMs are returned, with only 1 actually being from my string. When I ran the string with the gene names, all of my desired genes were present and more, so I know they are there.. perhaps not labeled with TIGR? I am not sure

fpusan commented 1 year ago

but have you used columns="PFAM" for that?

fpusan commented 1 year ago

I see that your search string is mixing also KEGG IDs, so maybe you could make use columns = c("KEGG ID", "PFAM"). Also I don't think we have TIGRfams in the annotation, so you will need to find other search terms for those families.

henslerk commented 1 year ago

but have you used columns="PFAM" for that?

When I used that specification only 1/3 of the PFAMs in my string were in the unique(subsetSQM$orfs$table$PFAM) in addition to 3 distinct PFAMS that were not in my string. Without the columns' specifications, I get 9/17, with the other 8 being the TIGRfams.

fpusan commented 1 year ago

You shouldn't be getting PFAMs not in your string if you are using columns = 'PFAMs'. Can you post here the exact command you are using and the list of PFAMs that this is returning?

henslerk commented 1 year ago

My apologies, with `> subsetSQM <- subsetFun(sqm, fun = 'PF12942|PF02461|TIGR03080|TIGR00836|PF18582|K00372|K00367|TIGR01287|K00368|K15864|TIGR04244|TIGR04246|K15576|TIGR01580|TIGR01792|K01999|K00053|K11073|TIGR02712|TIGR01729|TIGR01096', columns="PFAM")

unique(subsetSQM$orfs$table$PFAM) [1] "PF02461 [Ammonia monooxygenase]"` ... Only the one out of 3

fpusan commented 1 year ago

Can it be that that is the only PFAM from your list present in the data? What if you run subsetFun individually for each PFAM? (with the columns='PFAM' bit)

jtamames commented 1 year ago

Also I don't think we have TIGRfams in the annotation, so you will need to find other search terms for those families.

True. TIGRfams should indeed be present because we included them within the Pfam database, but.. I forgot to upload them in our database. My bad.