qiime2 / q2-diversity

BSD 3-Clause "New" or "Revised" License
4 stars 45 forks source link

ENH: adonis: validate metadata to check for NaNs #265

Closed nbokulich closed 3 years ago

nbokulich commented 4 years ago

Improvement Description Generate a user-friendly error when NaNs are present in the metadata

Current Behavior R spews some nonsense about non-conformable arrays

Proposed Behavior Catch this with some simple metadata validation, or capture the error message and report something more user-friendly.

shravya-95 commented 4 years ago

Is the issue in run_adonis.R file?

nbokulich commented 4 years ago

Hi @shravya-95 the issue is with vegan (R), this is not a bug in the code (otherwise I would label this Issue as a bug, rather than an enhancement).

Vegan adonis throws an error if there are NaNs in the sample metadata, so this is just a matter of pre-validating the metadata before running the R subprocess, and if NaNs are present raise a more user-friendly error.

To word this another way: the error raised by R causes a lot of confusion, and non-conformable arrays can mean a few different things but is usually caused by NaNs in the metadata. So checking for NaNs in the passed metadata will help users troubleshoot this issue more easily.

Let me know if you are interested in working on this issue!

shravya-95 commented 4 years ago

Thanks for clarifying @nbokulich ! I meant to ask if the enhancement is meant to modify the data sample.md that is being passed into adonis in run_adonis.R. Is this the dataframe that needs to be checked for NaNs?

Yes I am interested in working on this issue!

nbokulich commented 4 years ago

Thanks for clarifying @shravya-95 !

Yes, the dataframe passed to adonis is what should be checked for NaNs... you could probably check for missing values around this line to multi-task that validation.

However, I don't think we want to modify that dataframe (at least not by default), since it would yield misleading results, e.g., imagine if all but a few samples had NaNs for the selected metadata column(s), then results would be based on an exceedingly small number of samples... I suppose we could, e.g., add an option to drop samples with NaNs (and default to "error", i.e., if NaNs are present raise an error but don't attempt to automatically fix), though that becomes a more complicated fix so not sure if you want to go that route.

Thanks!

shravya-95 commented 4 years ago

Sure that makes perfect sense @nbokulich ! Thanks again for the clarification.