BUG: sample IDs that look like scientific notation are treated as numbers

colinbrislawn commented 7 months ago

Trying to close #130

More testing is needed, especially a regression test for sampleIDs like this.

lizgehret commented 7 months ago

Thanks for working on this @colinbrislawn! Lmk when it's ready for a review and I'll take a look 🙂

colinbrislawn commented 7 months ago

This seems to work:

t(read.delim(inp_abundances_path,
  check.names = TRUE, # change to true to so the empty first column becomes .$X
  row.names = 1,
  colClasses = c(X = "character") # Make the .$X column a character, leave other columns
))

Any idea why check.names = F originally?

For the regression test, maybe import this file and make sure the sample names match? table-SI-names.tsv

Could I just run test_ids_in_table_not_in_md() again, with a new set of files?

lizgehret commented 7 months ago

This seems to work:

t(read.delim(inp_abundances_path,
  check.names = TRUE, # change to true to so the empty first column becomes .$X
  row.names = 1,
  colClasses = c(X = "character") # Make the .$X column a character, leave other columns
))

Any idea why check.names = F originally?

There wasn't a strong reason for that being set to False in the initial implementation - I hadn't run into this issue while developing, so I didn't realize this would be an edge case we'd need to consider. This looks reasonable to me though!

For the regression test, maybe import this file and make sure the sample names match? table-SI-names.tsv

Could I just run test_ids_in_table_not_in_md() again, with a new set of files?

Yeah that should work! I think as long as we continue to use the original files/tests, and then just create additional tests with your new set of files to make sure the behavior is the same, that should work well.

colinbrislawn commented 7 months ago

Hey Liz, I think I'm ready for your help here.

I've got a new test_ids_in_table_with_es testing function that should fail because I've reverted the ancom import code... but it works fine. I'm also unfamiliar with the dataloaf format

EDIT:

and then just create additional tests with your new set of files to make sure the behavior is the same, that should work well

Yes! I've added those files. Can you help me write that?

lizgehret commented 7 months ago

Yes, happy to help! Hung up with a few other things this afternoon, but I'll take a closer look tomorrow 🙂

lizgehret commented 7 months ago

Okay just adding some notes here for myself as I'm investigating:

Forum user who reported this issue was using QIIME 2 2023.9 and R 4.2.2.
I replicated the error with the same package versions using their dataset.
I created a dummy table/md file inline in a test to replicate the behavior using the same sample IDs that the forum user did to make sure the error would catch (which it did).
I added two tests (that both fail) for uppercase E and lowercase E (to ensure that R doesn't care about casing).

Adding just these two tests (not finalized) in a commit so you can take a look @colinbrislawn!

EDIT: Let me know if you want me to finish this up, or if you'd rather take it across the finish line - I'm happy either way, just don't want to steal your thunder!

lizgehret commented 7 months ago

Another thing I decided to do @colinbrislawn was to just create the table/MD files inline instead of using your newly created files - sorry for flip flopping on that, I realized that would be easier for me to see/iterate on the table as I was testing!

colinbrislawn commented 7 months ago

Okay, I'll take a look. I'm excited to see how you set up testing.

Let me know if you want me to finish this up, or if you'd rather take it across the finish line

I would like more experience with Python, so I'll see if I can get this finished. This may be more work for you, but if you are willing to help me out I would very much appreciate it.

lizgehret commented 7 months ago

I would like more experience with Python, so I'll see if I can get this finished. This may be more work for you, but if you are willing to help me out I would very much appreciate it.

Happy to help, and in full support of you learning! You just let me know what questions you have 🙂

colinbrislawn commented 7 months ago

I've run into issues around metadata, so I've updated both import functions. (Good thing we tested!)

In the metadata importe, I reference the sampleID column with the key sample-id. Is this first column always called that in this plugin or should I support other names?

Zooming out a little, Tidyverse packages like readr may give us better ways to handle this. We could start the tidyverse retooling, or save it for the ancom-bc2 function!

lizgehret commented 7 months ago

Hey @colinbrislawn, sample-id is only one of the supported names for metadata identifiers (that's just the one I happened to use in the original test data); the full list can be found here. I'm a bit wary of hard-coding all of these into the import function unless absolutely necessary - but definitely open to any improvements that might be available within tidyverse, since that's already a dependency!

Re: ancom-bc2 - we were waiting for that paper to be published before starting development on this method, but it looks like that's finally happened as of a little over a month ago! I'll chat with the rest of the eng team in this week's meeting about this and see what thoughts are there.

colinbrislawn commented 7 months ago

I thought I remembered a bunch of allowed sample-ids!

Here's the options I can see:

Use the python code to make sure that column is called sample-id before passing it to the R code I have in this PR
Switch to the tidyverse readr::read_delim() and use their better API.

What do you suggest?

lizgehret commented 7 months ago

I thought I remembered a bunch of allowed sample-ids!

Here's the options I can see:
* Use the python code to make sure that column is called `sample-id` before passing it to the R code I have in this PR

* Switch to the tidyverse `readr::read_delim()` and use their better API.
What do you suggest?

Let's see what can be done with readr::read_delim() - I was reviewing the available parameters for this method in the R docs, and I'm wondering if a good solution would be to utilize the col_types param and set that to character for the first column (i.e. the index column containing the sample IDs). I haven't investigated this in detail, so not sure if this will actually work like we want it to, but it seems like a good option to explore!

lizgehret commented 6 months ago

Hey @colinbrislawn - just as a head's up, I'm going to have to bump this one to the next release cycle (as our current planned release is technically today - but realistically it's looking like it'll be tomorrow/Friday). We can definitely get this in as a patch to the current release epoch though, so these changes wouldn't have to wait an entire cycle before being added into the ecosystem.

Let me know if you need anything from me or have any questions!

colinbrislawn commented 6 months ago

Let's push it to the next release or try a patch.

Tidyverse should be easy to implement, but I'm away from my main dev PC right now so I'm a little delayed...

lizgehret commented 6 months ago

Sounds good @colinbrislawn, just let me know if you need anything from me or when you're ready for another review 🙂

gregcaporaso commented 6 months ago

@colinbrislawn, feel free to ping me for a review too (@lizgehret is out for the next few weeks).

colinbrislawn commented 6 months ago

Thanks Greg. Metadata has proven as hard as expected, but I've got feature data parsing up and running.

The inp_abundances_path now gets parsed into an otu_file with readr::read_tsv(), and the first column is always treated as a string. I think this is good to go!

I tried to import inp_metadata_path in the same way, but downstream steps need a base R data.frame with rownames, and Tidyverse does not believe in those. So I'm sticking with base-R for now.

I've got one more bug to fix before we merge...

colinbrislawn commented 5 months ago

Checks pass on my machine!

I've got to remove my own notes and code before we merge...

colinbrislawn commented 5 months ago

@lizgehret, I've got the R code working, including a patch to _ancombc.py to always output sample-id

When you have a chance, could you take a look at the unit test?

Thanks!

lizgehret commented 4 months ago

Hey @colinbrislawn!

Sorry for the delay, still getting caught up on things after vacation and travel. Will give this a proper review shortly!

colinbrislawn commented 4 months ago

I was initially thinking of IDE in terms of interactive development environment

Good idea. I noticed that too when I wrote it. This I can fix.

add assertions into the contents of the dataloaf itself for these new test cases.

I'm unfamiliar with the dataloaf format. Would you like me to try this first or is this something you could wrap up for me?

lizgehret commented 4 months ago

I'm unfamiliar with the dataloaf format. Would you like me to try this first or is this something you could wrap up for me?

I can finish that up, thanks @colinbrislawn! I'll wait for your name change updates re: IDE and then take it from there. 🙂

lizgehret commented 4 months ago

okay @colinbrislawn i've updated these tests to be more comprehensive - they now assert that the contents of the resultant IDs in each slice of the dataloaf are what we'd expect.

i also apologize for asking you to update the names of the files you added that contained IDE - i realized after the fact that we didn't need those files after all 🤦‍♀️

i'm going to get someone else to give this a final review (@colinvwood TIA!) since i'm no longer an unbiased reviewer 😉 but this should be g2g!

qiime2 / q2-composition

BUG: sample IDs that look like scientific notation are treated as numbers #131