Using more stringent parameters to avoid spurious sequences

apcamargo commented 5 years ago

Hey there,

I've been testing some PLASS assemblies with my datasets and I noticed that it retrieves ~50x times more proteins than my MEGAHIT+Prodigal predictions, which is way more than what is shown in Fig S2 of the paper.

I ran PLASS with default parameters, which means that a lot of very short peptides were retrieved (as the default value of --min-length is 45). I also noticed that most of them are incomplete (without * at the beginning and the end of the sequence). Even though partial peptides are to be expected in metagenome assemblies, I'm concerned that PLASS may be giving me a substantial amount of spurious sequences. In Swiss-Prot, ~8% of the prokaryotic proteins are between 45 and 100 residues. In my assemblies, 40% of the sequences fall into this interval.

In the paper, you excluded peptides shorter than 100 residues (which is something that I'd expect to attain by using --min-length 100), but I'm apprehensive of doing that because I don't want to left short bona-fide proteins out of my assembly.

Do you think raising the neural network threshold is a sound idea to solve the problem of spurious sequences? How was the default threshold (0.2) determined?

martin-steinegger commented 5 years ago

@apcamargo thank you trying Plass. Yes it is hard to reliably call short genes and distinguish the frames correctly. We set the threshold of our NN to not harm the sensitivity too much. It would for sure help to change the threshold to 0.5 or even higher.

Another idea for short genes would be, instead of doing a six-frame translation use some gene caller on the read and only assemble this fragments using Plass. This way a lot of the wrong frame assemblies should disappear.

Alternative approach to call short genes: If you have 150 paired end reads and they overlap than you can already use a gene caller to predict the genes from merged nucleotide sequences, for genes up to 80 length. They should be more reliable compared to the neural network from Plass since they use DNA and protein information.

apcamargo commented 5 years ago

Thank you, @martin-steinegger

I have 2×150bp reads that do not overlap, so I'm afraid that I won't be able to call genes in such short sequences. Thank you for the suggestion!

I'll try to raise the neural network threshold and see how much of a difference it does in my dataset. In case I still get a large fraction of short peptides, I'll probably disable partial sequences by allowing only peptides with a start and a stop codon.

I know that by filtering out partial peptides I'll filter out many real protein sequences but most of those partial peptides can't be functionally anotated anyway. Realistically, I don't think that I'll lose much information (especially considering that I'll aggregate my PLASS and my MEGAHIT+Prodigal proteins into a non-redundant dataset).

mooreryan commented 5 years ago

@apcamargo Did you have good results from adjusting the NN option with your shorter peptides? It would be helpful to know!

apcamargo commented 5 years ago

Hi @mooreryan

I classified my sequences into three groups:

Complete (with start and stop codons)
Semi-partial (with a start or a stop codon)
Partial (without start and stop codons)

Here's the length histogram for each of these groups: trimmedstrict

I've noticed that there's a bimodal distribution with a short peptides peak (<60aa) and a long peptides peak (>60aa). So, to test whether raising the NN threshold would improve the reliability of the assembled peptides, I BLASTed subsets of 500 short peptides of each group. In the table below, I put the number of sequences that had at least one hit in the nr database.

	Threshold=0.5	Threshold=0.2
Complete	175/500	173/500
Semi-partial	283/500	274/500
Partial	369/500	359/500

As you can see, the proportion of sequences that had at least one hit is not very different. As the assembly with stricter parameters had a much lower number of peptides (27065097 vs. 32156588), I decided to keep the default parameters.

mooreryan commented 5 years ago

@apcamargo Thanks, that is very helpful!!

soeding commented 5 years ago

@apcamargo Thanks for the very interesting analysis!

I was wondering what the left distributions below 60 amino acids length are. They look a bit like what you would expect of false calls with exponentially decaying length distribution (due to the random occurrence of stop codons). Could you plot them on a linear scale between 0 and 100 amino acids to get a better feeling for this?

Another possibility is that at least some of them are real but very short, expressed, and active peptides like the ones recently described by Amy Bhat's group: https://pdfs.semanticscholar.org/d608/782a771fb1127da20a2ae3a3473560bf34c1.pdf My hunch is that the complete ones with l<=60 are enriched for such functional peptides.

Could you plot the fractions of BLAST matches for l<=60 aa and l > 60 aa for your three groups (complet, semi, partial)? (It might turn out ambiguous because a large fraction of them will have no homologs in the database and homologies are more difficult to find for such short proteins.)

Perhaps some clustering and conservation analysis like the Bhatt lab performed could turn out interesting. They clustered their short putative proteins predicted in 1700 human-associated metagenomes and ran RNAcode (Washietl et al., 2011) on the MSAs, which classifies them as coding or noncoding based on amino acid versus nucleotide-level conservation.

This might be more appropriate to continue by email. Feel free to reply to soeding at mpibpc.mpg.de

martin-steinegger commented 3 years ago

@apcamargo Thank you again for your insightful analysis. We added a remark in the README of Plass. See: https://github.com/soedinglab/plass/blob/master/README.md

apcamargo commented 3 years ago

Thanks @martin-steinegger! After our conversation. Got some ideas to improve the NN for small peptides. If you plan to work on that in the future, let me know!

martin-steinegger commented 3 years ago

@apcamargo great to hear! Currently I do not work on it. But I am happy to help integrating the networks if needed.

soedinglab / plass

Using more stringent parameters to avoid spurious sequences #17