metagenome-atlas / atlas

ATLAS - Three commands to start analyzing your metagenome data
https://metagenome-atlas.github.io/
BSD 3-Clause "New" or "Revised" License
368 stars 97 forks source link

skani parameter usage in atlas derep #672

Closed bluenote-1577 closed 1 year ago

bluenote-1577 commented 1 year ago

Hi there,

I am the author of skani, and I recently discovered that atlas has switched to using skani for dereplication. I'm thrilled and I think skani is a great choice!

I noticed that in the the derep.smk file (https://github.com/metagenome-atlas/atlas/blob/891c60c149ace5a05f01d5da6bf6892967a152fc/workflow/rules/derep.smk) that the --robust parameter is specified.

To be honest, I haven't used this parameter in a while, and I caution against using it. It sometimes gives a better clustering, than with default parameters, but it actually biases the ANI upwards slightly. For example, often I see something like 95% ANI -> 96% ANI with --robust specified. So if you choose a dereplication threshold of 97.5, It may be something more like 96.5 instead.

I won't go into technical details, but skani is v0.1.4 right now and in skani v0.1.5 I plan on making this problem a bit better, but it will still have a bit of bias (I will update skani's readme to make this more clear in the future).

Overall, I would recommend not using the --robust parameter, unless you found it to be better, in which case I would be interested to hear. Maybe you can try it once I release v0.1.5.

Thanks,

Jim

P.S. Another option --trace is also used. I only have used it to debug very specific little things and I think it may not be useful for your logs :).

SilasK commented 1 year ago

Thank you for reaching out. I will make sure to implement your suggestion.

Do you have some recommendations about the alignment fraction? I think FastANI takes the average alignment fraction.

My intuition was to use the max alignment fraction as a small genome fragment aligned to a large genome with high ANI would indicate that the two should be clustered together. Anyway, the larger genome would be chosen as the cluster representative.

Maybe even better would be to remove the smaller genome if the difference between the two genomes is too large...

bluenote-1577 commented 1 year ago

FastANI appears to output only the alignment fraction for the query when doing something like fastANI -q gn1.fa -r gn2.fa and doesn't do any averaging (see output format in https://github.com/ParBLiSS/FastANI#an-example-run).

I remember reading about this alignment fraction problem in dRep: https://drep.readthedocs.io/en/latest/choosing_parameters.html#minimum-alignment-coverage where they just set the ANI to 0 if the aligned fraction is < 10% (by default, I think). But because there are 2 aligned fractions, I'm not sure which one they use.

I'm not sure your exact use case, but the maximum aligned fraction may become an issue if it's a mobile element, where this mobile element can have high ANI and high alignment fraction to two distant genomes, but these two distant genomes should not be clustered together. Something to think about...