superphy / spfy

Spfy: an integrated graph database for real-time prediction of Escherichia coli phenotypes and downstream comparative analyses
https://lfz.corefacility.ca/superphy/grouch/
Apache License 2.0
4 stars 2 forks source link

add all metadata in enterobase to blazegraph #218

Closed kevinkle closed 6 years ago

kevinkle commented 7 years ago

this can be seen as a followup to https://github.com/superphy/backend/issues/210

kevinkle commented 7 years ago

For future reference, case we ever need it:

ubcsamsung [5:32 PM] 
Hi Kevin

kevin [5:33 PM] 
Howdy

ubcsamsung [5:33 PM] 
Could you tell me how you get the filename when you download the genome data from enterobase?

[5:33] 
There seems to be some mismatch

kevin
[5:34 PM] 
Hmm

[5:34] 
so we take the `barcode` value under the `experiment` dictionary as the filename

[5:35] 
(after appending `.fasta`

[5:35] 
There are checks for files which aren’t assembled or are not found on enterobase

[5:35] 
Namely lines `52` and `9`

[5:36] 
can you elaborate on the mismatch?

ubcsamsung [5:37 PM] 
We are trying to match the serotype data from enterobase to the genome we have

[5:37] 
For example

[5:38] 
Hmm give me a sec

kevin
[5:38 PM] 
If im not mistaken

[5:39] 
this would be related to the difference between the `barcode` name in the `experiment` dicts vs the `strains` dict

ubcsamsung [5:39 PM] 
yes

kevin
[5:39 PM] 
for example, under `strains`, this might be `'ESC_AA7740AA` and `ESC_CA1647AA_AS` under `experiment`

ubcsamsung [5:40 PM] 
there seems to be some difference between assembly barcode and just barcode

kevin
[5:41 PM] 
try backtracing the `barcode` in `experiment` to its `id`

[5:42] 
this should give you a match to the row in `strains`

ubcsamsung [5:42 PM] 
where is the experiment file?

kevin
[5:42 PM] 
ie., barcode `ESC_AA7740AA` and `ESC_CA1647AA_AS` both use id `7740`

[5:43] 
you can start up a python interpretive environment (or use a script, if you’d like) and run
```    r = requests.post('http://enterobase.warwick.ac.uk/get_data_for_experiment', data=options)
    d = r.json()
    # d.keys()
    # [u'strains', u'experiment']
    strains = d['strains']
    experiment = d['experiment']```

[5:43] 
after running `import requests`, ofc

ubcsamsung [5:45 PM] 
options is?

kevin
[5:45 PM] 
ah right sorry

[5:45] 
this is from https://github.com/superphy/backend/blob/master/scripts/enterobase.py
GitHub
superphy/backend
Semantic superphy backend for distributing predictive genomics tasks

[5:45] 
where
```    options = {
        'no_legacy':'true',
        'experiment':'assembly_stats',
        'database':'ecoli',
        'strain_query_type':'query',
        'strain_query':'all'
    }```

[5:46] 
just mimicks the behavior of the `GET` request
kevinkle commented 6 years ago

Tests are passing as of https://github.com/superphy/backend/commit/e6aa5b75dd322b12bd338bffb55592ec23fb239a

kevinkle commented 6 years ago

Example of the metadata file expected is provided in https://github.com/superphy/backend/blob/218-metadata/app/tests/example_metadata.xlsx

Will test via reactapp now.

kevinkle commented 6 years ago

PR in https://github.com/superphy/backend/pull/251

kevinkle commented 6 years ago

Merged, closing issue.