ncbi / ngs-tools

Other
103 stars 25 forks source link

Way to access NCBI STAT data in bulk #11

Open Jalapenobadger opened 4 years ago

Jalapenobadger commented 4 years ago

Hi, I'm wondering is there any way to access the taxonomic data that STAT is automatically generating on each NCBI run? Every metagenomic upload on the SRA has this analysis generated and displayed as a Krona, but is there a route by which we could download this data in simple text form for playing around with association rule mining?

Also, is there a roadmap or website besides github anywhere dedicated to this project? Is there anywhere people can find more information about STAT like who works on it or what your future goals for it might be?

Thanks! -Pete

Jalapenobadger commented 3 years ago

I don't know if this would be helpful for you, but the answer to my question I eventually found through contacting the NCBI help email.

The SRA does not store itself this sort of data, it is contracted to cloud services, but you can freely access them. They sent me these links:

https://www.ncbi.nlm.nih.gov/sra/docs/sra-bigquery/ https://www.ncbi.nlm.nih.gov/sra/docs/sra-bigquery-examples/ https://www.ncbi.nlm.nih.gov/sra/docs/sra-cloud-based-examples/

By setting up a bigquery free sandbox account I have been able to access all of the raw outputs of the data as I was hoping to, i.e. for any run you can get access to the list of taxonomic names being generated. So you can setup a search to select from the metadata table only something containing bat coronavirus, and then maybe use the accession id to crossreference all the information on taxonomy or whatever it is you might be wanting to do.

I hope this helps, -Rocky Whitesell

On Thu, May 20, 2021 at 4:39 PM babarlelephant @.***> wrote:

Same question, any way to find the list of SRA containing Bat coronavirus in the taxonomy? Some people proved that even when it looks obvious non-sense, at least for high quality runs checking the few viral reads can lead to interesting results and even new genomes assembly.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ncbi/ngs-tools/issues/11#issuecomment-845460951, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACQB6LLD35U7KYVODHF3QIDTOVXPJANCNFSM4KLH4WAQ .

babarlelephant commented 3 years ago

Thanks a lot @Jalapenobadger. I could get all the accessions mentioning Coronaviridae in the taxonomy analysis (the full one visible in the html source code, the analysis tab of https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR2063951 is only showing the best matches)

I created a gmail account (I had to enter my phone number) then in https://console.cloud.google.com/bigquery I ran

SELECT acc FROMnih-sra-datastore.sra_tax_analysis_tool.tax_analysis WHERE name= "Coronaviridae"

I saved it as "local csv" obtaining 16000 results. To obtain the whole 229293 results I did "save on google drive".

Be careful that this interface is limited for free accounts, unless you enter a credit card number and get 300$ free tokens.

Jalapenobadger commented 3 years ago

Hey, glad I could help. I think there's a whole lot of potential that is being overlooked in these databases, I wish they were more widely known.

On Fri, May 21, 2021 at 5:44 PM babarlelephant @.***> wrote:

Thanks a lot @Jalapenobadger https://github.com/Jalapenobadger. I could get all the accessions mentioning Coronaviridae in the taxonomy analysis (the full one visible in the html source code, https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR2063951 is only showing the best matches)

I created a gmail account (I had to enter my phone number) then at https://console.cloud.google.com/bigquery and typed

SELECT acc FROM nih-sra-datastore.sra_tax_analysis_tool.tax_analysis WHERE name= "Coronaviridae"

Then I did save as "local csv"

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ncbi/ngs-tools/issues/11#issuecomment-846278315, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACQB6LPK5GCNBJSFMZ2LY6TTO3H4VANCNFSM4KLH4WAQ .