qiime2 / q2-demux

BSD 3-Clause "New" or "Revised" License
0 stars 20 forks source link

BUG: Improve error message in demux summarize with no data #104

Closed Oddant1 closed 4 years ago

Oddant1 commented 5 years ago

closes #103

Oddant1 commented 5 years ago

Working on adding tests with empty artifacts

Oddant1 commented 5 years ago

Changes in https://github.com/qiime2/q2-demux/pull/104/commits/a68e4f2e2686ebd9713e4da1a9c7ccc66afbefc8 were backtracked because upon further inspection of the code they did nothing.

Oddant1 commented 5 years ago

@thermokarst don't I already have test data for paired end empty_fwd and paired end empty_rev? Is there something wrong with it? Additionally, I am unsure of how to check for part of a visualization in a unittest, but I'll figure it out

thermokarst commented 5 years ago

@thermokarst don't I already have test data for paired end empty_fwd and paired end empty_rev?

Yep!

Is there something wrong with it?

Not to my knowledge. I sent you those files so that you could run the commands yourself.

Manual testing vs unit testing --- I want you to manually test, because I want to make sure you're seeing the big picture w.r.t. the viz.

Additionally, I am unsure of how to check for part of a visualization in a unittest, but I'll figure it out

Sort out the business logic first, using manual testing, unit testing patterns will become clear once you figure out the business logic.

Oddant1 commented 5 years ago

I've been doing a lot of testing, and there are some odd things I don't entirely understand going on here. Using your test files, summarize ends up with empty forward_scores and empty reverse_scores regardless of which test file I use. This causes the interactive_quality_plot to not render in either case, and I'm not entirely sure how to render it given the fact that there are "allegedly" no reads at all.

thermokarst commented 5 years ago

Using your test files, summarize ends up with empty forward_scores and empty reverse_scores regardless of which test file I use.

I noticed that in my testing, too! One thing I haven't looked at yet is how these data files behave in 2019.7, for comparison.

and I'm not entirely sure how to render it given the fact that there are "allegedly" no reads at all.

Perhaps look at how these files behave in 2019.7 (or master), then see if any of this weird behavior was pre-existing, or, if it was somehow accidentally introduced in this PR. Make sense?

ebolyen commented 5 years ago

I was talking to @Oddant1 earlier and we were talking about this situation.

paired-end-demux-empty-fwd.qza has no reads in the forward direction, and one read in the reverse direction. paired-end-demux-empty-rev.qza has one read in the forward direction, and no reads in the reverse direction.

@thermokarst, these seem to be valid, but it seems like the misspaired situation that we see on the forum once in a while.

Should these actually fail to validate? We can't really do anything downstream if the fwd and rev reads don't correspond (and this seems different from both being empty, which is maybe a more reasonable state).

Alternatively, maybe we could just prevent fastq from being empty at all? I don't know what the implications of that would be downstream, but I can't think of a good reason to have an empty fastq, rather I would expect the sample to drop out (which usually isn't an option with the fasta variety which needs to be empty)

thermokarst commented 5 years ago

@thermokarst, these seem to be valid, but it seems like the misspaired situation that we see on the forum once in a while.

Yep, I think we have seen this often enough to put it on the radar.

Should these actually fail to validate?

Not sure, I have definitely wondered about that, though. I don't know if these cases actually represent invalid data, though. I mean, obviously it would be a silly thing to have this, but trimming tools like cutadapt seem prone to outputting results like this. As well, in cutadapt you have the option of saving the trimmed reads, so I could see some kind of "splitting" action here. Certainly that could all be handled with more refined types though, just rattling off some thoughts.

Alternatively, maybe we could just prevent fastq from being empty at all?

That would be nice, I think that we have some technical debt related to optional outputs that might be getting in the way. Right now a few actions in the ecosystem rely on being able to output empty fastq (as opposed to being optional).

thermokarst commented 5 years ago

@Oddant1, following up on an offline discussion I had with @ebolyen here. I am less concerned with the case of missing fwd or rev reads, and more-so interested in us just getting a solid grip on the business logic. I think exploring that business logic through those weird edge cases will be beneficial for getting things locked down. Does that make sense?

Oddant1 commented 5 years ago

@thermokarst what do you mean by business logic? I'm currently interpreting that to mean the logic behind why we might want to accept files/artifacts like these in the first place, but I'm not sure.

thermokarst commented 5 years ago

@thermokarst what do you mean by business logic?

I mean the flow of data through this action. How do we structure, guide, and process data from start to finish? Business logic is related to, but different from the implementation of the code. So, the business logic I am concerned with includes things like "why are we looking at the quality scores to determine our next step," and "why do these badly formatted data inputs produce surprising outputs after passing through the viz." It could be the case that it is a mistake in the implementation, but I think it is better to start with clearing up the business rules first, since that governs how the action should actually behave. Make sense?

Oddant1 commented 5 years ago

What we currently do on master looks something like this

  1. Obtain sequences and number of sequences
  2. Obtain forward scores
  3. Obtain forward stats
  4. Repeat the above 2 steps for reverse if we are paired
  5. Generate the html
  6. Write to the data.jsonp

Notice, that everything aside from this odd little line of code on line 127 assumes forward reads will always exist


https://github.com/qiime2/q2-demux/blob/d5dbcb3591d92fb6beff2d2ee7ba77c081907b71/q2_demux/_summarize/_visualizer.py#L123-L127


Currently if there are no forward reads we break on step 3 and get that "cannot describe an empty dataframe" deal. On my feature branch, the only change I've really made is to explicitly check for the presence of forward reads in the same way as we explicitly check for the presence of reverse reads. That was the intent anyway.

So it's pretty apparent why we currently explode on lack of forward reads. As for why we don't get any forward scores out of either file, it definitely happens on the master branch not just the feature branch. If we run the command with the .qza with an empty forward, we count 0 sequences, but if we use the one with an empty reverse, we do count a single sequence.

It seems as though we lose this sequence when it comes time to map it to a file, and I think I know why.


https://github.com/qiime2/q2-demux/blob/d5dbcb3591d92fb6beff2d2ee7ba77c081907b71/q2_demux/_summarize/_visualizer.py#L152-L164


On line 153, we call the function _link_sample_n_to_file. This next screenshot is what that function does with comments annotating the specific values we will have when the function executes using the .qza that has one forward read and no reverse reads.


Selection_099


So when we hit the sample_map creation on line 157 we get this as our sample map


Selection_100


[forward_file, reverse_file, 0] telling us we have nothing. I have no idea why _link_sample_n_to_file does what it does, but this seems to be the issue with this particular case. It looks like adding more reads MIGHT solve the particular bizarre issue we've run into with it saying there are no reads in this instance, but surely this shouldn't be a thing? Additionally, this particular scenario should only be a thing when we have a sample size of 1 and we happen to have chosen 0 as our subsample. This shouldn't really happen in reality I don't think.

thermokarst commented 4 years ago

Thanks for this, @Oddant1. I am going to put in ~90 minutes of effort on this to see if I can wrap it up - stay tuned.

thermokarst commented 4 years ago

Okay @Oddant1, I made a bunch of pretty broad tweaks in 8f68dc7, let me know if you have any questions. I left a TODO list in that changeset, I hope that will help provide some guidance to run this across this finish line.

Oddant1 commented 4 years ago

Thanks @thermokarst, sorry I didn't see this earlier. I was tackling a lot of conda/git/computer related issues for most of today, but they should all be resolved now. I've pulled down your commit, done a first pass through the code, and identified where all the tests are failing and why. I'll get to addressing those todos tomorrow.

Oddant1 commented 4 years ago

Thinking about https://github.com/qiime2/q2-demux/pull/104/commits/68983e4896d67c0d65731c3a0434404b8fe353ec a bit more, I'm not sure this is actually what needs to happen here.I have a different idea coming next commit

Oddant1 commented 4 years ago

This last test is probably going to be significantly more of a pain to fix. Multiple files are being used with the same sample-id. This is failing when we attempt to transform the manifest to a dataframe because the last step of that transformation wants to pivot the dataframe so the sampleid column is the index, but all three sampleids are "sample1" causing it to break. This may be solvable using pivot_table instead of pivot in the transfromer. I'll test it and see how it goes.

I also considered the possibility of making the filename an iterable of filenames instead, but this will probably need to be done in all cases (so when there is only one file name) and would probably also need to happen in the transformer. I am open to other suggestions.

The test in question: https://github.com/Oddant1/q2-demux/blob/74c64f45b2e7fcc90bb74d3a7d3399c040e3b551/q2_demux/tests/test_demux.py#L922-L961

The transformer in question (note, it is in q2-types): https://github.com/qiime2/q2-types/blob/004403d689f0aa62c54f5041078883773600e304/q2_types/per_sample_sequences/_transformer.py#L390-L400

Reference to pivot_table method: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot_table.html#pandas.DataFrame.pivot_table

Edit: Not sure pivot_table is going to do what we need here.

thermokarst commented 4 years ago

The test in question:

Nuke that test (test_single_sample_multiple_files)! It is super specific, and, it shouldn't even be possible to hit this case anymore. I see no need to make your life painful just for this one silly little test (that I wrote three years ago, apparently).

Oddant1 commented 4 years ago

Works for me :joy:

Oddant1 commented 4 years ago

As of https://github.com/qiime2/q2-demux/pull/104/commits/5439727f1380333cdce5237da154dbabc037d2eb all tests pass. I haven't bothered cleaning up some linting errors yet though.

Oddant1 commented 4 years ago

I think the todo list item

- fix "The minimum sequence length identified during subsampling was undefined bases"

was resolved when I made the minSeqLen a dictionary keyed on direction again. The line that reports the minSeqLen is created separately for each direction. Once when plotBoxes is called with the forward data and again when it's called with the reverse data. This is the line responsible for getting the minSeqLen to be put into the note mentioned in the todo.

const minSeqLen = seqProps.minSeqLen[data.direction];

When minSeqLen was just a single int in https://github.com/qiime2/q2-demux/pull/104/commits/8f68dc7e9cf025f35665f0a773e10a7873a764f4 this line would attempt to index into that integer as if it were a dictionary (or I guess a map or something in Javascript? Idk). This was causing minSeqLen to always be undefined. Now the indexing works again.

Oddant1 commented 4 years ago

I don't really know where to go from here. I can't really think of any new tests that obviously need to be added, and I'm unsure of how to go about consolidating the state variables. There is a lot of state there, but it all seems necessary, and none of it is particularly complicated.

thermokarst commented 4 years ago

Thanks @Oddant1, I will take a look at your comments and commits above and let you know if there is anything else. Before I do that though, can you fix the linting errors that travis is reporting? Once that's set, please move this to "In Review" and reassign to me. Thanks!

Oddant1 commented 4 years ago

@thermokarst I cleaned up the non header related linting errors and rebased so the headers are good too. We should pass all tests and have no linting errors on Travis now

thermokarst commented 4 years ago

Thanks @Oddant1! I made some changes to the JS, so before diving in, don't forget to make clean && make). Here are a few laundry-list items:

Let me know if you have any questions - thanks!

Oddant1 commented 4 years ago

Looks like we get nothing at all for the artifact with empty forward reads. It says there should be a plot for reverse reads but there's nothing at all. Screenshot from 2020-05-15 11-15-07

thermokarst commented 4 years ago

Did you remember to run make clean && make, after fetching the latest changes in this branch? This is what the plots look like for me:

empty-fwd

Screen Shot 2020-05-15 at 11 18 24 AM

empty-rev

Screen Shot 2020-05-15 at 11 18 59 AM

Oddant1 commented 4 years ago

I did, but I'll do it again.

Oddant1 commented 4 years ago

It's working now, maybe I accidentally ran it before pulling the latest commit? Dunno

thermokarst commented 4 years ago

/shrug - who knows... glad you got it working!

Oddant1 commented 4 years ago

Ok unfortunately it looks like the seven number summaries did actually change. I must have forgot to pull the commit earlier. This is why I need to not try to do work before drinking coffee :joy:. Looking into potential causes for these discrepancies now while I also look at how to go about adding that placeholder to the viz.

Oddant1 commented 4 years ago

Also worth noting something I don't think I mentioned earlier. The minSeqLen for the reverse read in one of the tests changed from 5 to 3, I think this has to do with the subsampling looking at different indices than it was before. In fact in the case of this test I'm certain of it. When both forward and reverse subsampling were done together we were looking at indices 0 and 1 in both forward and reverse. Now wer're looking at 0 and 1 in forward and 0 and 2 in reverse. The test in question is test_sequence_length_uses_subsample_paired I just changed the value it was looking for once I realized what was causing the issue.

thermokarst commented 4 years ago

This is why I need to not try to do work before drinking coffee

lol! ☕ <- drink up my friend

Looking into potential causes for these discrepancies now while I also look at how to go about adding that placeholder to the viz.

Sweet!

The minSeqLen for the reverse read in one of the tests changed from 5 to 3, I think this has to do with the subsampling looking at different indices than it was before.

This makes sense - I think the old behavior was actually using the global minimum, rather than the direction's minimum length (although I would need to double-check that statement).

Oddant1 commented 4 years ago

@thermokarst https://github.com/qiime2/q2-demux/pull/104/commits/72d5fe56f6f8602812d717f7c3c133da637b3676 takes a similar approach to the placeholder histograms. It basically just makes the empty div for the quality plot take up the amount of space it would take up if it actually had any information and prints there are no reads in whichever direction is empty. Sorry that took a while. This stuff is fairly unknown to me, so it took a while just to figure out where I needed to be making edits.

EDIT: Also, the seven number summaries have some amount of rng in how they select which reads they use in determining the statistics right? Because everytime I rerender the viz the numbers change slightly, so we wouldn't expect a viz I render from the moving pictures data to be identical to the viz in the actual tutorial would we? I'll use qiime2 2020 and this latest commit to render the viz based on the moving pictures data with all the sequences selected and see if they're the same

EDIT2: Yeah when I render the viz using all the sequences with 2020.2 vs this latest commit they are definitely the same. The discrepancies earlier were just the rng involved in selecting sequences.