nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.76k stars 629 forks source link

[Feature Request] Support for querying the SRA metadata using AWS Athena and Google BigQuery #1842

Closed abhi18av closed 1 year ago

abhi18av commented 3 years ago

New feature

The recent collaboration between NCBI and the cloud providers allows one to query the entire archive based on the metadata in AWS Athena.

Here are some relevant resources for the same

https://www.ncbi.nlm.nih.gov/sra/docs/sra-cloud/

https://registry.opendata.aws/ncbi-sra/

https://www.ncbi.nlm.nih.gov/sra/docs/sra-athena-examples/

https://www.youtube.com/playlist?list=PLH-TjWpFfWrt5MNqU7Jvsk73QefO3ADwD

NOTE: The same could be done for GCP cloud as well, for now I've not created a separate issue for that.

Suggested implementation

I'm sure there must be a more elegant implementation but as an initial draft for this implementation, we could implement this in a couple of ways

  1. As a separate method fromNCBI, which allows one to pass a closure based query for any particular database from NCBI.
def ncbi_query = { db, orgnsm -> 
"""
SELECT *
FROM $db.metadata 
WHERE organism = $orgnsm
limit 10 
"""
}

Channel.fromNcbi ( query: ncbi_query("SRA", "Homo Sapien") )
  1. Or as a more specialized enhancement of the fromSRA method, which allows a closure to be passed to the query field. For example,
def ncbi_query = { db, orgnsm -> 
"""
SELECT *
FROM $db.metadata 
WHERE organism = $orgnsm
limit 10 
"""
}

Channel.fromSRA ( query: ncbi_query("SRA", "Homo Sapien") )

Related https://github.com/nextflow-io/nextflow/issues/1605

abhi18av commented 3 years ago

This could overlap with https://github.com/nextflow-io/nextflow/pull/1611

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

pditommaso commented 3 years ago

This is a good candidate for a Nextflow plugin along the same way of nf-sqldb

abhi18av commented 3 years ago

I'd be happy to give this a shot 👍

abhi18av commented 3 years ago

Update:

The work is being done on my fork as of now https://github.com/abhi18av/nextflow/tree/abhinav/nf-sraql , with BigQuery as the default source.

Once it is presentable, I'll create and link the PR to this repo.

pditommaso commented 3 years ago

Cool! Willing to make a PR so changes will be more clear?

abhi18av commented 3 years ago

Absolutely, will make a PR ~by EOD today~ 👍

Initiated the draft PR with the scratch work, happy to receive any feedback.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

abhi18av commented 2 years ago

WIP - not stale.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

abhi18av commented 2 years ago

Whoa, this went under my radar after the health crisis. Confirming @pditommaso if this is still relevant and I'd be happy to pick this back up and make a push

pditommaso commented 2 years ago

Not a priority but surely a nice to have. Should not this working via db jdbc connection? What's missing?

abhi18av commented 2 years ago

I think it is already working for BigQuery, but I needed to accommodate paging issues for large set of results.

pditommaso commented 2 years ago

The most useful thing it would be an example in the readme. without that nobody will even know it exists

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.