Closed henslerk closed 1 year ago
I am going to try quickly with the genes vector and a specification for the column "Gene names" in the orfs$table
hmms
should be a string containing a regular expression. So you would need to have a single string, in which the different search terms are joined by the |
delimiter, rather than a vector...
E.g. subsetFun(SQM, 'term1|term2|term3')
You can also add the parameter columns="PFAM"
to restrict the search to that column
Okay, thank you. I tried doing that without the columns specified and it returned a gene for every hmms minus those that are TIGRFAM.. I then tried a filter with the specific gene names and it returned more than the specified genes. With the columns="PFAM", only 1/17 of the genes are returned
You can run unique(subsetSQM$orfs$table$PFAM)
to make sure that you only got the PFAMs that you wanted.
This is my string: PF12942|PF02461|TIGR03080|TIGR00836|PF18582|K00372|K00367|TIGR01287|K00368|K15864|TIGR04244|TIGR04246|K15576|TIGR01580|TIGR01792|K01999|K00053|K11073|TIGR02712|TIGR01729|TIGR01096
, however with that unique command, only 4 PFAMs are returned, with only 1 actually being from my string. When I ran the string with the gene names, all of my desired genes were present and more, so I know they are there.. perhaps not labeled with TIGR? I am not sure
but have you used columns="PFAM"
for that?
I see that your search string is mixing also KEGG IDs, so maybe you could make use columns = c("KEGG ID", "PFAM")
. Also I don't think we have TIGRfams in the annotation, so you will need to find other search terms for those families.
but have you used
columns="PFAM"
for that?
When I used that specification only 1/3 of the PFAMs in my string were in the unique(subsetSQM$orfs$table$PFAM)
in addition to 3 distinct PFAMS that were not in my string. Without the columns' specifications, I get 9/17, with the other 8 being the TIGRfams.
You shouldn't be getting PFAMs not in your string if you are using columns = 'PFAMs'
.
Can you post here the exact command you are using and the list of PFAMs that this is returning?
My apologies, with `> subsetSQM <- subsetFun(sqm, fun = 'PF12942|PF02461|TIGR03080|TIGR00836|PF18582|K00372|K00367|TIGR01287|K00368|K15864|TIGR04244|TIGR04246|K15576|TIGR01580|TIGR01792|K01999|K00053|K11073|TIGR02712|TIGR01729|TIGR01096', columns="PFAM")
unique(subsetSQM$orfs$table$PFAM) [1] "PF02461 [Ammonia monooxygenase]"` ... Only the one out of 3
Can it be that that is the only PFAM from your list present in the data?
What if you run subsetFun
individually for each PFAM? (with the columns='PFAM'
bit)
Also I don't think we have TIGRfams in the annotation, so you will need to find other search terms for those families.
True. TIGRfams should indeed be present because we included them within the Pfam database, but.. I forgot to upload them in our database. My bad.
Hello, I tried to filter my original SQM object with
subsetSQM <- subsetFun(sqm, fun = hmms)
with the fun= being a vectorhmms <- gen_list$hmms
created from a text file with a list of hmms loaded withgen_list <- read.delim("gen_list.txt", header = TRUE, sep = "\t")
with two columns (gene and hmms) (ie. PF12942|PF02461|TIGR03080; TIGR00836; PF18582; K00372|K00367; etc.). The object size decreased from 13.06GB to 5.89GB, so I had thought it was successful; however, once I looked into thesubsetSQM$orfs$table
I realized that the subsetted object still contains almost 10k unique genes. Do you know what has been filtered out for the object for it to decrease in size? Also, can you please advise on how to best use thesubsetFun()
to filter for a list of IDs/genes (17 to be exact)?