shenwei356 / taxonkit

A Practical and Efficient NCBI Taxonomy Toolkit, also supports creating NCBI-style taxdump files for custom taxonomies like GTDB/ICTV
https://bioinf.shenwei.me/taxonkit
MIT License
357 stars 29 forks source link

Include LCA queries in output even when all taxids are skipped #89

Closed standage closed 7 months ago

standage commented 7 months ago

Prerequisites

Describe your issue

Hi, thank you for maintaining this valuable tool!

I have a feature request for the taxonkit lca operation.

Consider the following example.

$ echo -e '743375\n2975295\n987654321\n743375 2975295\n743375 987654321\n2975295 987654321' | taxonkit lca 
15:51:39.871 [WARN] taxid 2975295 was deleted
15:51:39.871 [WARN] taxid 987654321 not found
15:51:39.871 [WARN] taxid 2975295 was deleted
15:51:39.871 [WARN] taxid 987654321 not found
15:51:39.871 [WARN] taxid 2975295 was deleted
743375  743375
2975295 0
987654321       0
743375 2975295  0
743375 987654321        0
2975295 987654321       0

All of the queries return 0 except for the first, since they contain at least one deleted or invalid taxid. The --skip-deleted and --skip-unfound flags allow a "rescue" of the query, as long as there is at least 1 valid taxid. However, if there are no taxids left after deleted and unfound taxids are skipped, the query is excluded completely from the results output.

$ echo -e '743375\n2975295\n987654321\n743375 2975295\n743375 987654321\n2975295 987654321' | taxonkit lca --skip-deleted
15:51:44.723 [WARN] taxid 2975295 was deleted
15:51:44.723 [WARN] taxid 987654321 not found
15:51:44.723 [WARN] taxid 2975295 was deleted
15:51:44.723 [WARN] taxid 987654321 not found
15:51:44.723 [WARN] taxid 2975295 was deleted
15:51:44.723 [WARN] taxid 987654321 not found
743375  743375
987654321       0
743375 2975295  743375
743375 987654321        0
2975295 987654321       0
$ echo -e '743375\n2975295\n987654321\n743375 2975295\n743375 987654321\n2975295 987654321' | taxonkit lca --skip-unfound
15:51:47.945 [WARN] taxid 2975295 was deleted
15:51:47.945 [WARN] taxid 987654321 not found
15:51:47.945 [WARN] taxid 2975295 was deleted
15:51:47.945 [WARN] taxid 987654321 not found
15:51:47.945 [WARN] taxid 2975295 was deleted
743375  743375
2975295 0
743375 2975295  0
743375 987654321        743375
2975295 987654321       0
$ echo -e '743375\n2975295\n987654321\n743375 2975295\n743375 987654321\n2975295 987654321' | taxonkit lca --skip-deleted --skip-unfound
15:51:52.278 [WARN] taxid 2975295 was deleted
15:51:52.278 [WARN] taxid 987654321 not found
15:51:52.278 [WARN] taxid 2975295 was deleted
15:51:52.278 [WARN] taxid 987654321 not found
15:51:52.278 [WARN] taxid 2975295 was deleted
15:51:52.278 [WARN] taxid 987654321 not found
743375  743375
743375 2975295  743375
743375 987654321        743375

It would be helpful if there was a way to include all queries in the output—even those with no valid taxids—when the --skip-* flags are used. It would look something like this.

$ echo -e '743375\n2975295\n987654321\n743375 2975295\n743375 987654321\n2975295 987654321' | taxonkit lca --some-magical-flags
15:51:39.871 [WARN] taxid 2975295 was deleted
15:51:39.871 [WARN] taxid 987654321 not found
15:51:39.871 [WARN] taxid 2975295 was deleted
15:51:39.871 [WARN] taxid 987654321 not found
15:51:39.871 [WARN] taxid 2975295 was deleted
743375  743375
2975295 0
987654321       0
743375 2975295  743375
743375 987654321        743375
2975295 987654321       0
shenwei356 commented 7 months ago

Added a new flag -K/--keep-invalid: print the query even if no single valid taxid left.

standage commented 7 months ago

Excellent! I'll test and report back. Thanks for your rapid response!

standage commented 7 months ago

Thank you, this works just as expected!