jordenrabasco / Long_read_processing_tutorial

2 stars 0 forks source link

[Section] Dereplication and sanity check #8

Open jordenrabasco opened 2 years ago

jordenrabasco commented 2 years ago

This is the introduction to the dada2 method and an explanation and sanity check of the dereplication. Let me know what you think!

https://github.com/jordenrabasco/Long_read_processing_tutorial/blob/afa1a962b305b79b0473a644cd9133a992bfa9ea/long%20read%20Tutorial.Rmd#L121-L127

benjjneb commented 2 years ago

If for any reason you need to stop the tutorial, the R object can then be loaded in and the workflow continued from this step.

What does this mean exactly? And, would a tutorial reader know how to perform this action?

As we can see the list of unique sequences and their counts were generated. We can also see that there is a significant relative abundance of these unique sequences. This allows the "learnerrors" module to run appropriately. If there wasn't enough abundance for each unique amplicon then the error model wouldn't run correctly. However, this doesn't seem to be the case here and we can therefore assume that the dereplicaiton procedure was a success! If you wish to check the other samples you can switch the sample name in the code "head(drp$R11_1_P3C3.fastq.gz$uniques)" to whichever sample you would like to view.

This needs to be substantially revised.

jordenrabasco commented 2 years ago

"If for any reason you need to stop the tutorial, the R object can then be loaded in and the workflow continued from this step."

I think this was left over from before I split the dereplicaiton and the error plots section. I have since move that sentence to the error plot section of the tutorial.

jordenrabasco commented 2 years ago

"This needs to be substantially revised."

I have rewritten the section substantially by adding descriptions of the r object generated from the deprecliation function as how it is structured. I also provide instruction on how to investigate the object further. I removed the part of this section talking about the error modules to avoid confusion and separate the sections more fully. Let me know what you think of the new version! It should be submitted now in its own git push

benjjneb commented 2 years ago

Can you link the commit?

Also, a useful feature is that you can include issue numbers in commit messages, and they will be automatically linked into that issue. E.g. if you had included addresses #8 in your commit message, a link to that commit would show up in this issue thread automatically.

jordenrabasco commented 2 years ago

Ah okay I will do that next time! The commit should be linked here: 22bc885af601d46d5b39f3b08f7a9067461baa25

benjjneb commented 2 years ago

As we can see the list of unique sequences and their associated counts were generated appropiratly (sic).

How would a new user know whether the output above shows that unique sequences were generated appropriately? This is important, because for long reads in particular the dereplication step is an important sanity check that the data is appropriate for DADA2. Explanation of why that is, and what "appropriate" and "not appropriate for DADA2" outputs would be is needed here.

Additionally, if for any reason you need to stop the tutorial, the saved R object can then be loaded in and the workflow continued from this step.

How would one do this?

jordenrabasco commented 2 years ago

outputs needed here do you mean just example tables or something more substantial?

benjjneb commented 2 years ago

outputs needed here do you mean just example tables or something more substantial?

Example outputs aren't even needed probably. A description of how to interpret the output of dereplication, that makes it clear when the output suggests things are OK, and when it isn't OK.

jordenrabasco commented 2 years ago

Okay cool the updated commit linked above should have those changes

benjjneb commented 2 years ago

I don't think a user could interpret that text to their own data. Let's say someone w/o familiarity with DADA2 runs the derepFastq step, and gets 1000 unique sequences out of their 1024 reads. What does that mean? Can they decipher what that means from the current text?

jordenrabasco commented 2 years ago

updated