serratus-bio / open-virome

monorepo for data explorer UI and APIs
http://openvirome.com/
GNU Affero General Public License v3.0
0 stars 0 forks source link

Add MWAS interface as a new page on OpenVirome.com #43

Open lukepereira opened 4 months ago

lukepereira commented 4 months ago

I think this feature would be pretty useful and would likely encourage users to return to the site.

If we're concerned about costs, we can only allow calls for small-medium MWAS jobs and make it unrestricted for us to run internally. It also seems possible that MWAS on the virome could look up values in Declan's pre-computed rfam.

possible plots:

Another thing to explore would be combining the v-enrichment scores with pre-computed values on rfam. i.e. find "important" sOTUs in a virome, map them to a list of rfam/biosamples, then look up pre-computed pvalues in s3

ababaian commented 3 months ago

@lukepereira can you check in with Ethan on what will be needed to get this online? I think he's really close with his lambda implementation.

almosnow commented 3 months ago

[I'll ask Ethan for his GH account to get him here as well]

finesden33 commented 3 months ago

https://github.com/declanlim/mwas

mwas repo (the readme is very outdated)

ababaian commented 2 months ago

@finesden33 Can you update the README and what's the status on the API MWAS being online?

lukepereira commented 1 week ago

I added a prototype of the MWAS plot since I think it could be useful when combined with disease filters. It's currently using pre-computed MWAS results with virus families, I found that results can be fetched fairly quickly using s5cmd. You can view it by clicking 'Advanced' in the Virome section.

Screen Shot 2024-11-18 at 3 48 18 PM

In the future, it would be nice to support running different user defined MWAS jobs using Ethan's lambda. Would need to clarify, but i think we can define the target set as the current query and the background set as all runs in matched bioprojects (?). We also likely want to re-run this workload using ethan's updated code to resolve some bugs (1, 3, 5).

Some limitations of the existing approach:

  1. Biosamples were not tracked in the data. I try to infer the possible biosamples by including any in the bioproject that contain the virus family, but it's annoying to have to click through biosample if there are multiple with the matching virus family. I also considered linking out to a defined search on NCBI, but it doesn't seem to be possible.
  2. It's not clear that results are from various tests that were run against virus families and we're surfacing BioProjects that match their query. Users likely expect the MWAS test to be run using their query as a target set.
  3. I found some data with +inf and -inf values in the p-values or fold change. Likely a division by 0 bug. For now I remove all those rows.
  4. I limit results to the top 1000 and limit the number of bioprojects fetched for large queries. We can likely increase this value if we were to move from fetching results on S3 to using a document-based database for metadata.
  5. There are a lot of virus families missing from the pre-computed MWAS data. I'm not sure if this was intentional or something went wrong with the original run.