qiime2 / q2-dada2

QIIME 2 plugin wrapping DADA2
BSD 3-Clause "New" or "Revised" License
19 stars 36 forks source link

Ids mixed with barcodes #125

Closed Oddant1 closed 1 year ago

Oddant1 commented 4 years ago

This pull request closes #102 by basically removing the bar code number from the equation entirely, so we end up sorting on the id, and then on the R1/R2 guaranteeing the files are lined up correctly.

The test files are copies of test files from the sample_seqs_paired folder renamed to match the files referenced in #102. Unfortunately this introduces another issue. The new test fails with the following error message ValueError: Metadata IDs must be unique. The following IDs are duplicated: 'V130'

This comes from the fact that we split the filename on '_' then take the first element as the id when we create the metadata.

    df = df.rename(index=_filepath_to_sample)
    metadata = qiime2.Metadata(df)
def _filepath_to_sample(fp):
    return fp.rsplit('_', 4)[0]

This seems to disallow the use of underscores in sample ids. I don't really see this as a problem, but do we want to change it? Or should I change the ids being tested to say 130 and 130.2 (instead of 130_2) to continue replicating the issue referenced in #102 without raising this additional issue?

Oddant1 commented 4 years ago

Pinging this in case @thermokarst or @ebolyen want to make a decision on the question raised in the description. If not I'll probably go with the "change the id to 130.2" option and just maintain the implied "no _ in ids"

thermokarst commented 4 years ago

I need to look more closely at this - putting it on my todo list now. Thanks!

thermokarst commented 4 years ago

@Oddant1 - I don't think this will make the 2020.6 release - pushing this onto the 2020.8 release board. Thanks!

Oddant1 commented 4 years ago

Hmmm, I didn't make a MANIFEST for my test data. I suppose I should do that first.

What I just said above raises another issue though. @thermokarst are these artifacts required to have MANIFEST files? Because I was able to create a SingleLanePerSamplePairedEndFastqDirFmt for testing without a MANIFEST.

EDIT: Also if it's alright, I'm just going to change the ids to 130 and 130.2 do we don't have the whole no _ in sample ids issue mentioned in the original PR message. The forum post used _ but for the purposes of this bit of functionality it shouldn't matter.

thermokarst commented 4 years ago

are these artifacts required to have MANIFEST files?

Yes! Fortunately, they aren't really something you need to think too hard about, if you import as a CasavaOneEightSingleLanePerSampleDirFmt for example, the transformers will make the manifest for you. Every SampleData[SequencesWithQuality | PairedEndSequencesWithQuality] will have a manifest available in its "base format".

Because I was able to create a SingleLanePerSamplePairedEndFastqDirFmt for testing without a MANIFEST.

You've been here before, remember? We ran into some issues where you weren't calling validate on some test data formats, which lead to some funky test issues down the line. Remember, in QIIME 2 unit tests, you'll need to validate manually, unless you are somehow invoking the test through the framework (which will validate for you).

Also if it's alright, I'm just going to change the ids to 130 and 130.2 do we don't have the whole no _ in sample ids issue mentioned in the original PR message.

The underscores tie directly into the original issue, though: "Certain ID schemes can become mixed up with barcodes during sorting" - the sample ID and barcode are _-separated - how about you add tests for both cases?

Oddant1 commented 4 years ago

You've been here before, remember? We ran into some issues where you weren't calling validate on some test data formats, which lead to some funky test issues down the line. Remember, in QIIME 2 unit tests, you'll need to validate manually, unless you are somehow invoking the test through the framework (which will validate for you).

Right, I remember that now. Thank you.

The underscores tie directly into the original issue, though: "Certain ID schemes can become mixed up with barcodes during sorting" - the sample ID and barcode are _-separated - how about you add tests for both cases?

I'm trying to remember what the issue was because it's been so long. I think my understanding in the original post on this PR was off. I was thinking the issue didn't have to do specifically with the _ but with the presence of that additional information in the first place. But yeah it seems like it does have to do specifically with the _.

Oddant1 commented 4 years ago

Ok something really goofy is happening when I run dada2 stuff. Locally on this branch I'm getting a million errors from R. The one failure that travis is getting here is what I expected

EDIT: Updated my dev environment and now it isn't being goofy

Oddant1 commented 4 years ago

As of https://github.com/qiime2/q2-dada2/pull/125/commits/41c39e55e061c59c00a0d6e70b6ef861704e4a9c we remove the barcode from the filenames in denoise_paired before duplicating the files which should force them to sort on sample_ids.

Are there any downsides to this?

Additionally, if there are no downsides, do we want to just do it for denoise_single as well? Currently we have to do some kinda funky stuff in _denoise_helper because if we called it from denoise_paired the barcode is already gone, but if we called it from denoise_single the barcode is still there.

thermokarst commented 4 years ago

Hey @Oddant1, there are some code changes in the diff that are either from another PR, or maybe in-resolved changes from master. Either way, can you clean this branch up when you get time, then I'll give it another review. Thanks!

Oddant1 commented 4 years ago

@thermokarst I think that should remove all the stuff from the diff that I didn't do.

lizgehret commented 1 year ago

Hey @ebolyen do we still want to move forward with this one? If so, @Oddant1 can you resolve the merge conflicts?

Oddant1 commented 1 year ago

@lizgehret actually no it turns out these conflicts suck. I'll get it done at some point though. Just not gonna be completely trivial.

Oddant1 commented 1 year ago

@lizgehret Merge issues should be resolved. Not sure what else is gonna need to happen here if anything.