mhoban / rainbow_bridge

GNU General Public License v3.0
5 stars 2 forks source link

uncultured, seqids in lca/collapse_taxonomy (many fewer zOTUs/taxa kept in lca results in more recent runs) #79

Closed cajwalsh closed 2 months ago

cajwalsh commented 2 months ago

I have been running some older individual datasets together (starting with the merged fastas) over the past few months and noticed that I have been getting fewer zOTUs/taxa kept in lca results in more recent runs. I first thought it may have been the difference between lulu and non lulu that I hadn't noticed, but then I also noticed what I mentioned in #72 . More recently I noticed that it is in the lca step.

The differences between my original run on August 2023 and repeated for an individual dataset (exact same starting input files) July 2024:

Keeping only eukaryotes brings the old down to 1452 and the new down to 789.

One major difference might therefore be the uncultured filter added/fixed. From old and new including all domains, there were 276 old "species (column) values" not found in new: 178 are matched using grep("uncultured|environmental|sample|clone|synthetic"). However, there are also still 59 hits for the same pattern in the new species (so idk if the uncultured filter is on by default or actually working).

52 eukaryote "species column values" (35 unique) are found in old but not new. Of the ones I checked, some have congeners that appear in both old and new. The first few I looked at that appeared in old but not new, several had "PREDICTED: " hits to other taxa in the new blast results, but the main species found in old but "missing" in new were not "PREDICTED: ".

The other (possibly main) thing I noticed was that all the "missing" ones only had one seqid everywhere they appeared in the new blast results, whereas other taxa in the same genus as missing ones had multiple seqids throughout the blast results of many zotus. I think every species I found in old but not new appeared in the blast results of the first zotu they matched either below a 97% match or within 1% of others and therefore collapsed. Is it possible that line 107 of bin/collapse_taxonomy.R (or anything doing something similar since the .keep_all thing may be doing something) is keeping only 1 line of that seqid across all blast results instead of just all blast results per individual zotu? It looks like this line was last changed around the same time as I noticed insect giving different results in #72. This is as far as I got so far but let me know if I should look into more.

If you want to have a look at the differences, I have attached some files. Both sets of blast results, 97% pid lca, and the list of species level assignments found in old but not new. check_me.zip

cajwalsh commented 2 months ago

I just did some testing and I think the distinct call on line 107 is the problem. When I run the new blast results through those first 3 steps (loading, sorting, filtering using distinct(seqid, .keep_all = TRUE)) I go from the 28967 to only 4703. When I change the distinct call to also include zotu as distinct(seqid, zotu, .keep_all = TRUE) I only go down to 16560.

mhoban commented 2 months ago

This makes sense, and gives me some ideas about other potential problems I've been hearing about.

mhoban commented 2 months ago

This should be addressed (along with another problem I spotted) in 2e287af

I tested the old (python) version against this new version and the only differences between the two were uncultured things that are now being filtered out (and can be retained if desired).

cajwalsh commented 2 months ago

Thanks! FYI I reran the one I did over the past few days with --collapse-taxonomy added. It appears to have worked but I think finalize freaked out:

Command executed:

finalize.R --filter "" --remap "" --insect "insect_taxonomy.tsv" --lca "lca_taxonomy.tsv" --controls "" --control-action "remove" --control-threshold "0.1" --decontam-method "auto" --concentration "" --abundance-threshold "0.0001" --rarefaction-method "perm" --permutations "100" --taxon-priority "lca" --curated "NOTADANGFILE.nothing.lulu" zotu_table.tsv

Command exit status: 1

Command output: (empty)

Command error: Error in mutate(): ℹ In argument: taxid = coalesce(taxid_lca, taxid_insect). Caused by error: ! object 'taxid_lca' not found Backtrace: ▆

  1. ├─taxonomy %>% ...
  2. ├─dplyr::mutate(...)
  3. ├─dplyr:::mutate.data.frame(...)
  4. │ └─dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
  5. │ ├─base::withCallingHandlers(...)
  6. │ └─dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
  7. │ └─mask$eval_all_mutate(quo)
  8. │ └─dplyr (local) eval()
  9. ├─dplyr::coalesce(taxid_lca, taxid_insect)
    1. │ └─rlang::list2(...)
    2. └─base::.handleSimpleError(...)
    3. └─dplyr (local) h(simpleError(msg, call))
    4. └─rlang::abort(message, class = error_class, parent = parent, call = error_call) Execution halted

Work dir: /drives/storage/users/cajw/eDNA/PS/joint_DMA_Niue_COI/rainbow_bridge/work/2a/5d3bf44279b8727df796254928503e

Tip: when you have fixed the problem you can continue the execution adding the option -resume to the run command line

-- Check '.nextflow.log' file for details

cajwalsh commented 2 months ago

Hmm I also noticed that taxid is no longer included in lca output. If that is a desired and possibly permanent change lmk and I will change some of my functions to no longer look for that.

mhoban commented 2 months ago

I removed it because it wasn't being assigned correctly (see #80 & #83). In the meantime I'll update it to include the taxid of species-level IDs. Anything else will be NA.

cajwalsh commented 2 months ago

Ah makes sense. I definitely don't need it I just specified how to merge it in one of my functions. Based on those issues I think fine to leave it out and no need to change it just for me. I will just change my function. Thanks!

mhoban commented 2 months ago

Ok this should actually work now.