markziemann / dee2

Digital Expression Explorer 2 (DEE2): a repository of uniformly processed RNA-seq data
http://dee2.io
GNU General Public License v3.0
39 stars 7 forks source link

Nicer web UI for searching #80

Open markziemann opened 4 years ago

jamespeterschinner commented 3 years ago

Hey Mark,

I've added a new branch. It'd be great if your able to set up the development environment as per the README, that will no doubt highlight potential configurations/set up that I may have missed.

markziemann commented 3 years ago

I will set you up with a devel environ similar tomorrow. For now are you able to ssh into prod and look around?

jamespeterschinner commented 3 years ago

Hi Mark,

Yes, I took a look around (worked perfectly, thank you). I realized the Apache config is a little different than what I am used to, seems like this is a Debian quirk? Do you think we should we include these files on github?

In regards to setting up the dev environment, I'd hoped you would be able to build out the front-end code locally allowing you to monitor progress. Additionally, it would be a good test to determine if I have included all the necessary information required to get it set up.

The Elasticsearch (ES) component is going well, I've added some search as you type suggestions and getting good search results. This seems to be great for exploring the data set with out prior knowledge, however as I read the ES documentation I don't think search results are guaranteed to be exhaustive. It's almost like we need a two stage search, one to get quick suggestions and preliminary results and another to comprehensively find all (which the current search does!). Something to discuss on Wednesday.

jamespeterschinner commented 3 years ago

It currently looks something like this:

image

markziemann commented 3 years ago

I will see if I can obtain the study level metadata using the approach in https://github.com/markziemann/dee2/issues/84

I used the reutils R package to get the metadata from NCBI and it can be provided as XML or JSON https://github.com/markziemann/dee2/commit/ca254ca258b2c76d21b8c80472faa9e78a96183a

jamespeterschinner commented 3 years ago

That's great, were you able to download all data? I believe there are 21742 study's currently processed in dee2 out of the ~270,000 on NCBI (according to big query).

My feeling is that this would get us up and running quickly, but I'm not sure if E-UTILS will be able process all the requests in a timely manner. When I had a go, I used a naive approach requesting chunks of 200 study's at a time at ~9 requests per second (with API key), this resulted in being disconnected from the server. The documentation for E-UTILS does suggest that you can upload many (potentially thousands) of id's using the POST endpoint (https://www.ncbi.nlm.nih.gov/books/NBK25497/) and maybe this method would be more appropriate.

There is also the SRR metadata to consider.

The metadata downloadable from NCBI (ftp://ftp.ncbi.nlm.nih.gov/sra/reports/Metadata/) and also the EUTILS data naturally lends its self to a relational data store. While it would be a bit of a pain to implement, storing the bulk data in a SQL database and only indexing 'searchable' data in ElasticSearch would be a more complete solution. If we managed this, it would also allow a parametric search page such as this example: https://www.digikey.com.au/products/en/integrated-circuits-ics/embedded-system-on-chip-soc/777?k=soc (this would be scope creep)

markziemann commented 3 years ago

Yes, this 2-step approach was able to get >10k GEO records so I think it will work for SRA metadata as well.

ESEARCH_RES <- esearch(term=MY_GEO_QUERY_TERM, db = "gds", rettype = "uilist", retmode = "xml", retstart = 0, 
  retmax = 5000000, usehistory = TRUE, webenv = NULL, querykey = NULL, sort = NULL, field = NULL, 
  datetype = NULL, reldate = NULL, mindate = NULL, maxdate = NULL)

ESUMMARY <- esummary(ESEARCH_RES)

GSE <- paste("GSE",ESUMMARY$xmlValue("//GSE"),sep="")
GSM <- ESUMMARY$xmlValue("//Accession")
jamespeterschinner commented 3 years ago

Hey Mark,

Did you manage to get the study abstracts using E-UTILS?

I've just created a new repository: https://github.com/jamespeterschinner/sra_metadata_parser

Aimed at solving this particular issue (and broader contexts). At the moment the code is slapped together, I tried pretty much every parsing library Rust has to offer, before settling on quick_xml

This program was able to extract all experiment accessions (SRX, SRP, SRS) and study's including the title and abstract on my laptop taking about 11min, 2min of which was file io.

time ./target/release/sra_metadata_pipeline -f NCBI_SRA_Metadata_Full_20201006.tar.gz -d ./

real    11m26.250s
user    5m25.738s
sys     0m17.287s

Generating

-rwxrwxrwx 1 james james 300M Dec 20 01:15 experiments.csv
-rwxrwxrwx 1 james james 139M Dec 20 01:15 studies.csv

Which look like: image

and

image

markziemann commented 3 years ago

Much appreciated James. I

Do you think you could filter the results for transcriptomic studies that are publicly available (not restricted acces)? Then we'd have to mark each study as complete or incomplete somehow.

My attemt with the R based eutils and got most of the SRA information but not the abstract... Then I searched the webs and found this which is close to what we wanted https://www.biostars.org/p/417189/

For example it looks like we can get the following to work for E. coli

esearch -db sra -query '"Escherichia coli"[orgn:__txid562]) AND "transcriptomic"[Source]) AND "public"[Access] ' | esummary -format native | xtract -pattern STUDY_ABSTRACT -element STUDY_ABSTRACT

jamespeterschinner commented 3 years ago

No problem Mark,

Yeah I think that should be a pretty straight forward task. I want to include all the run metadata in this as well.

I'm in two minds in how to proceed. I think it would be ideal to load all this data in to a SQL database (sqlite) and then query and tag relevant data during elastic search indexing or perform this filter and tagging straight from the generated csv files.

The first option would be more 'correct' but it would be adding another layer of complexity (and could enable more advanced searches). The second option could probably be implemented with a simple R/Python script.

Maybe the E-utils could be used to load tool tips with more info on the web page?

Providing the metadata dumps are continued, I think having a 'single source of truth' would simplify matters. I'm not sure how much extra data would be generated from including the SRR data. So far we are at 440mb with just a cross reference table and the abstracts. Potentially there may be gigabytes of data which could become a challenge to download using http requests alone.

UPDATE The public status of a record is located in the *sample.xml files which means we'd have to use the data relations to do the filtering. Meaning we should just load all this into sqlite and then perform the query.

jamespeterschinner commented 3 years ago

I added the extraction code for the sample's SRS, species, description and status

Running

time ./target/release/sra_metadata_parser -f NCBI_SRA_Metadata_Full_20201006.tar.gz -d ./data

real    17m25.726s
user    7m37.005s
sys     0m36.466s

Files

-rwxrwxrwx 1 james james 300M Dec 21 18:36 experiments.csv
-rwxrwxrwx 1 james james 609M Dec 21 18:36 samples.csv
-rwxrwxrwx 1 james james 139M Dec 21 18:36 studies.csv

samples.csv snip

image

UPDATE

I loaded these files into sqlite (which took >30min) and performed the following query. Seems to be a solution sqlite> select * from studies inner join experiments on studies.srp = experiments.srp inner join samples on samples.srs = experiments.srs where insdc_status = "public" limit 1;

ERP023552,leg_outbreak,"Genomic investigation of a suspected outbreak of Legionella pneumophila ST82 reveals undetected heterogeneity by the present gold-standard methods, Denmark, July to November 2014","Between July and November 2014, 15 community-acquired cases of Legionnaires´ disease (LD), including four with Legionella pneumophila serogroup 1 sequence type (ST) 82, were diagnosed in Northern Zealand, Denmark. An outbreak was suspected. No ST82 isolates were found in environmental samples and no external source was established. Four putative-outbreak ST82 isolates were retrospectively subjected to whole genome sequencing (WGS), followed by phylogenetic analyses with epidemiologically unrelated ST82 sequences. The four putative-outbreak ST82 sequences fell into two clades, separated by ca?3,650 single nucleotide polymorphisms (SNP)s when recombination regions were included but only by 12 to 21 SNPs when these were removed. A single putative-outbreak ST82 isolate sequence segregated in the first clade. The other three clustered in the second clade, where all included sequences had &lt; 5 SNP differences between them. Intriguingly, this clade also comprised epidemiologically unrelated isolate sequences from the United Kingdom and Denmark dating back as early as 2011. The study confirms that recombination plays a major role in L. pneumophila evolution. On the other hand, strains belonging to the same ST can have only few SNP differences despite being sampled over both large timespans and geographic distances. These are two important factors to consider in outbreak investigations.",ERX2068933,ERP023552,ERS1789631,ERS1789631,"Legionella pneumophila","",public