mani2012 / PathoStat

The purpose of this package is to perform Statistical Analysis on the PathoScope generated reports files.
8 stars 9 forks source link

OTU filtering #18

Open mlbendall opened 7 years ago

mlbendall commented 7 years ago

For those not at BU, we had a conversation today about how different analyses have different filtering requirements for the data. For example, you should not filter low-abundance OTUs for alpha diversity calculations, but there are other situations where you might want to filter for analysis or visualization. So we concluded:

  1. The entire raw PathoID reports should be read in and stored
  2. We need a general purpose function for filtering the data. For example, get only the top 10 OTUs, or get all OTUs that account for >1% of the data, or remove OTUs that are only present in one sample.
  3. There will be intermediate layer that performs this filtering. Functions should assume that it is being handed a properly filtered object.

There are other details that need to be sorted out, such as how to track if users upload pre-filtered data, etc.

ecastron commented 7 years ago

The Santiago team agrees with this 100%!

I'd like to add that while pathoID writes a sorted .tsv file, it's sorted by Final Guess and sometimes you want it sorted by Final Read Numbers. If we read the full pathoID output without any cutoffs, then in phyloseq you can easily get the top X by issuing something like:

top10 <- names(sort(taxa_sums(physeq), TRUE)[1:10])

Someone may want to define the top X by proportions instead of counts, in which case a transformation is needed:

physeq <- transform_sample_counts(physeq, function(x) x / sum(x) )

Regarding point 3, I think users should be warned to upload unfiltered results only, and let pathoStats decide when it's appropriate to filter.

BTW, @mlosada323 mentioned rarefaction for 16S data. That's also a oneliner in phyloseq:

physeq_rare<-rarefy_even_depth(physeq, sample.size =1000,replace=FALSE, rngseed=T);physeq_rare

Cheers,

Eduardo

PS: The alluvial plot is almost done! @Sanrrone

captura de pantalla 2016-07-20 a las 18 08 51
mlbendall commented 7 years ago

Wow looks nice @Sanrrone!

mlbendall commented 7 years ago

Can you make a remote branch and push up what you have currently? I'd like to look at how you are getting the sample condition.

Sanrrone commented 7 years ago

Im confusing about how remote branch works, I did make a pull request, is the same?, wherever, you can looks the change in my fork: https://github.com/Sanrrone/PathoStat

mlbendall commented 7 years ago

Oh, didn't know you were working on a fork.

Remote branch is in the same repository, while fork creates a new repository. There is currently debate about when to branch or fork, but it boils down to how closely you are involved with the original project and whether your changes will eventually be incorporated into the original project.

Just make sure to keep your fork in sync with master, and (ideally) merge the upstream master and test your code before making a pull request. Same goes for branches.