qiime2 / q2-composition

BSD 3-Clause "New" or "Revised" License
5 stars 27 forks source link

BUG: ANCOM-BC Fails if sample IDs look too much like exponent #130

Closed cherman2 closed 4 months ago

cherman2 commented 7 months ago

Bug Description We had a user on the forum that reported getting the error message "duplicate 'row.names' are not allowed " from ANCOM-BC. For this user it was because their sample-ids looked like exponents. My theory is that the sample-ids are losing some resolution when they are converted to exponents and so they are becoming "duplicates". I was able to fix this by replacing sample-ids with E to F and then it worked fine.

Forum x-ref

Steps to reproduce the behavior

  1. Run ancom-bc with forum users data

Steps to reproduce the solution In python:

  1. from qiime2 import Artifact
  2. import pandas as pd
  3. table = Artifact.load("/Users/chloeherman/Downloads/forum-table.qza")
  4. df = table.view(pd.DataFrame)
  5. df.to_csv("/Users/chloeherman/Downloads/forum-table-3.txt", sep = "\t", header = True)
  6. edit ids in txt to not have "E"
  7. Go to R
  8. run read.delim(file, check.names = FALSE, row.names = 1) < this will fail if you haven't done step 6.
colinbrislawn commented 7 months ago

I'm interested in helping with this one.

gregcaporaso commented 7 months ago

Thanks @colinbrislawn, that would be great. Do you have an idea of timeline? We do have a release coming up around mid-Feb, and it would be nice to get a fix in for this by then (so would probably need a PR by around the end of next week for that to happen).

Let us know if you need any input.

colinbrislawn commented 7 months ago

If you can provide some support, I bet I can get this submitted by next week.

gregcaporaso commented 7 months ago

Sounds good @colinbrislawn, what kind of support would you need? Someone on the team in my lab should be able to help out as needed.

colinbrislawn commented 7 months ago

what kind of support would you need?

idk, let's find out!

Chloe, can you send me the forum-table.qza or forum-table-3.txt, Or maybe just post head forum-table-3.txt here?

I hope this can be fixed with a careful use of read.delim read_csv read_tsv to better support sample names.

colinbrislawn commented 7 months ago

I've confirmed what Chloe suspected, which is that specific sample names are being converted to numbers.

This only happens when all sample name look like SI notation, e.g. 0001E002 If a single name breaks the format, like L0001E002, it's fine, which is why we haven't see it sooner.

In my testing, simply adding colClasses = c("character") to read.delim() fixes it. I've opened a PR: #131

More testing is needed!

cherman2 commented 7 months ago

Thanks for looking into this. Looks like you made some progress so you might not need it but here is forum-table-2.txt forum-table-2.txt This is the file that I got to pass by changing the Es in the sample id column to Fs

colinbrislawn commented 4 months ago

Proposed change log:

q2-composition

@colinbrislawn and @lizgehret fixed a :bug: in ANCOM-BC where sample IDs that look like scientific notation (1e23) where being interpreted as numbers.

lizgehret commented 4 months ago

changelog text looks great, thanks @colinbrislawn! will get this added in the next couple of weeks 🙂