soedinglab / metaeuk

MetaEuk - sensitive, high-throughput gene discovery and annotation for large-scale eukaryotic metagenomics
GNU General Public License v3.0
174 stars 23 forks source link

Empty output files. #53

Closed DanieleDeco closed 1 year ago

DanieleDeco commented 1 year ago

I am running metaeuk on assembled (metaSpades) contigs sampled in the Antarctic Ocean,

Unfortunately, my output files are all empty. It seems that my sequences did not pass the prefiltering.

Attached you can find the log file.

Thanks

Daniele nohup.txt

elileka commented 1 year ago

Hi Daniele,

Would it be possible for you to send us one of your contigs so that we can have a closer look?

Theoretically, it is possible that none of the contigs contains genes that are similar to UniProt90 but it is not the most likely scenario... Can you say anything about the organisms you expect to have contigs of? How many of your 5M contigs are longer than, say 5000 bp?

Eli

DanieleDeco commented 1 year ago

Dear Eli,

Please find attached a few contigs of my input file. Our samples are metegenomes from the surface marine Antarctic Ocean, so we would expect to have a few eukaryotic phytoplankton sequences. 8000 sequences were larger than 5000 bp.

Cheers Daniele small_contigs.txt

elileka commented 1 year ago

Hey Daniele,

Thank you for sending these contigs.

Generally, I would say that contigs of less then 1000bp are not very likely to contain a full protein. I would consider only those of >2000-5000bp as "good quality". This means that out of your millions, many are very very short. Perhaps too short.

I ran easy-predict on the contigs you provided against the MetaEuk results from our paper (I mean these 6M unique predictions). I chose this database because it is relatively small and I work locally. It ran fast and produced 24 protein predictions from your contigs. This may suggest that UniProt90 is less suitable for you.

I therefore suggest the following steps: 1) Download a contig of a well-studied organism, for example: Saccharomyces cerevisiae 2) Run easy-predict on this contig against your copy of UniProt90. If you get predicted proteins then we can rule out a severe technical problem with running easy-predict. If you DON'T get results -> please contact us again 3) Assuming you get results for this test, then it strengthens the hypothesis that you have less/un-studied organisms in your dataset. I therefore suggest you tried using our marine profile database as target, instead of UniProt90. This is a large database, which can be found here. Please note, though, that currently, it can only be used with MetaEuk V5 (we're working on updating it)(updated Dec 2022). If you wish to run against a much less comprehensive (but relevant for marine) database, then use the one I mentioned above.

If any of these steps doesn't make sense, please feel free to write again.

Best, Eli

DanieleDeco commented 1 year ago

Dear Eli,

I ran easy-predict on my contigs using the Tara Ocean database (provided by your paper) and got plenty of matches.

Thanks for your help.

Cheers

Daniele

elileka commented 1 year ago

Dear Daniele,

First of all, I am happy this worked.

I am also curious, did you use the profile database we released or the (much slimmer) protein database of predictions? If the slimmer version, then, depending on your needs, it might be worth it to run against the profile database. I give links to both in my previous comment.

It is also interesting that there was no match to UniProt90. Obviously, there are organisms that are quite far from the well studied ones but I am still quite surprised. Well, if you have interesting insights you'd like to share, you know where to find us :)

Best, Eli

DanieleDeco commented 1 year ago

Dear Eli,

I only tried with MetaEuk_preds_Tara_vs_euk_profiles_uniqs.fas. I am also surprised that I did not get any match with the UniProt90, I might try to run it again.

Cheers

Daniele

elileka commented 1 year ago

That would be interesting. To save time and effort, you could run only your longest/most-rich-in-predictions-based-on-the-6m contig. If it returns nothing, I would also run the control Saccharomyces control.