sensein / covid19

A survey protocol for covid19
https://sensein.github.io/covid19/
Apache License 2.0
0 stars 7 forks source link

How to best parse jsonld files? #53

Open danielmlow opened 3 years ago

danielmlow commented 3 years ago

If we just use the json package, we get something like this for a single activity (covid19 questionnaire) of a single submission:


import json
with open(input_dir+directory+'/activity_0.jsonld','r'   ) as f:
    data = json.load(f)

df = pd.DataFrame(data[0])
for d in data:
    print('\n===================')
    for key, value in d.items():
        print(key, value)

@context https://raw.githubusercontent.com/ReproNim/reproschema/1.0.0-rc2/contexts/generic
@type reproschema:ResponseActivity
@id uuid:8d84c9d2-d517-4f87-adc7-a28b959dc659
used ['https://raw.githubusercontent.com/ReproNim/reproschema-library/a996c81dd546051f192db03b03da6d8ee8ff6a25/activities/NDA/items/yearOfBirth', 'https://raw.githubusercontent.com/sanuann/covid19/master/activity/covid19/covid19_schema', 'https://raw.githubusercontent.com/sanuann/covid19/master/protocol/Covid19_schema']
inLanguage es
startedAtTime 2021-02-07T02:42:27.283Z
endedAtTime 2021-02-07T02:42:38.013Z
wasAssociatedWith {'version': '0.0.1', 'url': 'https://sensein.github.io/covid19/', '@id': 'https://github.com/ReproNim/reproschema-ui'}
generated uuid:2758c790-002b-4901-83d3-9964bc927bd4
===================
@context https://raw.githubusercontent.com/ReproNim/reproschema/1.0.0-rc2/contexts/generic
@type reproschema:Response
@id uuid:2758c790-002b-4901-83d3-9964bc927bd4
wasAttributedTo {'@id': '7f3da208-4323-4ea0-b1fe-3d9401586be8', 'subject_id': 'es_5'}
isAbout https://raw.githubusercontent.com/ReproNim/reproschema-library/a996c81dd546051f192db03b03da6d8ee8ff6a25/activities/NDA/items/yearOfBirth
value 1983
===================
@context https://raw.githubusercontent.com/ReproNim/reproschema/1.0.0-rc2/contexts/generic
@type reproschema:ResponseActivity
@id uuid:ad82a144-b329-478e-981e-331eae3b69b8
used ['https://raw.githubusercontent.com/sanuann/covid19/master/activity/covid19/items/covid19_clinical_history', 'https://raw.githubusercontent.com/sanuann/covid19/master/activity/covid19/covid19_schema', 'https://raw.githubusercontent.com/sanuann/covid19/master/protocol/Covid19_schema']
inLanguage es
startedAtTime 2021-02-07T02:42:38.013Z
endedAtTime 2021-02-07T02:43:02.546Z
wasAssociatedWith {'version': '0.0.1', 'url': 'https://sensein.github.io/covid19/', '@id': 'https://github.com/ReproNim/reproschema-ui'}
generated uuid:25714edb-6d16-4c8a-99ba-04b1be67e283
===================
@context https://raw.githubusercontent.com/ReproNim/reproschema/1.0.0-rc2/contexts/generic
@type reproschema:Response
@id uuid:25714edb-6d16-4c8a-99ba-04b1be67e283
wasAttributedTo {'@id': '7f3da208-4323-4ea0-b1fe-3d9401586be8', 'subject_id': 'es_5'}
isAbout https://raw.githubusercontent.com/sanuann/covid19/master/activity/covid19/items/covid19_clinical_history
value [5, 9, 8]
===================
@context https://raw.githubusercontent.com/ReproNim/reproschema/1.0.0-rc2/contexts/generic
@type reproschema:ResponseActivity
@id uuid:5026fa5b-baef-402c-a131-040169de20f7
used ['https://raw.githubusercontent.com/sanuann/covid19/master/activity/covid19/items/smoking_history', 'https://raw.githubusercontent.com/sanuann/covid19/master/activity/covid19/covid19_schema', 'https://raw.githubusercontent.com/sanuann/covid19/master/protocol/Covid19_schema']
inLanguage es
startedAtTime 2021-02-07T02:43:02.546Z
endedAtTime 2021-02-07T02:43:07.676Z
wasAssociatedWith {'version': '0.0.1', 'url': 'https://sensein.github.io/covid19/', '@id': 'https://github.com/ReproNim/reproschema-ui'}
generated uuid:f239b9ef-b965-4a49-8577-8a681372907e
===================
@context https://raw.githubusercontent.com/ReproNim/reproschema/1.0.0-rc2/contexts/generic
@type reproschema:Response
@id uuid:f239b9ef-b965-4a49-8577-8a681372907e
wasAttributedTo {'@id': '7f3da208-4323-4ea0-b1fe-3d9401586be8', 'subject_id': 'es_5'}
isAbout https://raw.githubusercontent.com/sanuann/covid19/master/activity/covid19/items/smoking_history
value 2
===================
@context https://raw.githubusercontent.com/ReproNim/reproschema/1.0.0-rc2/contexts/generic
@type reproschema:ResponseActivity
@id uuid:e59b0ac5-0baa-44cc-a820-db4be18e2f4a
used ['https://raw.githubusercontent.com/sanuann/covid19/master/activity/covid19/items/covid19_status', 'https://raw.githubusercontent.com/sanuann/covid19/master/activity/covid19/covid19_schema', 'https://raw.githubusercontent.com/sanuann/covid19/master/protocol/Covid19_schema']
inLanguage es
startedAtTime 2021-02-07T02:43:07.676Z
endedAtTime 2021-02-07T02:43:13.250Z
wasAssociatedWith {'version': '0.0.1', 'url': 'https://sensein.github.io/covid19/', '@id': 'https://github.com/ReproNim/reproschema-ui'}
generated uuid:31ac4a12-fbc5-415b-9138-0002d90a0eee
===================
@context https://raw.githubusercontent.com/ReproNim/reproschema/1.0.0-rc2/contexts/generic
@type reproschema:Response
@id uuid:31ac4a12-fbc5-415b-9138-0002d90a0eee
wasAttributedTo {'@id': '7f3da208-4323-4ea0-b1fe-3d9401586be8', 'subject_id': 'es_5'}
isAbout https://raw.githubusercontent.com/sanuann/covid19/master/activity/covid19/items/covid19_status
value 2
===================
@context https://raw.githubusercontent.com/ReproNim/reproschema/1.0.0-rc2/contexts/generic
@type reproschema:ResponseActivity
@id uuid:fc01c8f5-f18d-448f-ab1d-271404f6a5a6
used ['https://raw.githubusercontent.com/sanuann/covid19/master/activity/covid19/items/covid19_status_tested', 'https://raw.githubusercontent.com/sanuann/covid19/master/activity/covid19/covid19_schema', 'https://raw.githubusercontent.com/sanuann/covid19/master/protocol/Covid19_schema']
inLanguage es
startedAtTime 2021-02-07T02:43:13.250Z
endedAtTime 2021-02-07T02:43:18.044Z
wasAssociatedWith {'version': '0.0.1', 'url': 'https://sensein.github.io/covid19/', '@id': 'https://github.com/ReproNim/reproschema-ui'}
generated uuid:78dbd848-7671-4ed7-a05c-a19aee6c398c
===================
@context https://raw.githubusercontent.com/ReproNim/reproschema/1.0.0-rc2/contexts/generic
@type reproschema:Response
@id uuid:78dbd848-7671-4ed7-a05c-a19aee6c398c
wasAttributedTo {'@id': '7f3da208-4323-4ea0-b1fe-3d9401586be8', 'subject_id': 'es_5'}
isAbout https://raw.githubusercontent.com/sanuann/covid19/master/activity/covid19/items/covid19_status_tested
value 1
===================
@context https://raw.githubusercontent.com/ReproNim/reproschema/1.0.0-rc2/contexts/generic
@type reproschema:ResponseActivity
@id uuid:76c9d95b-cb0b-42ef-9fbd-d4b23e885910
used ['https://raw.githubusercontent.com/sanuann/covid19/master/activity/covid19/items/covid19_status_symptoms_positive', 'https://raw.githubusercontent.com/sanuann/covid19/master/activity/covid19/covid19_schema', 'https://raw.githubusercontent.com/sanuann/covid19/master/protocol/Covid19_schema']
inLanguage es
startedAtTime 2021-02-07T02:43:18.044Z
endedAtTime 2021-02-07T02:43:29.812Z
wasAssociatedWith {'version': '0.0.1', 'url': 'https://sensein.github.io/covid19/', '@id': 'https://github.com/ReproNim/reproschema-ui'}
generated uuid:f6105047-7084-4663-952d-eed378fcf3e7
===================
@context https://raw.githubusercontent.com/ReproNim/reproschema/1.0.0-rc2/contexts/generic
@type reproschema:Response
@id uuid:f6105047-7084-4663-952d-eed378fcf3e7
wasAttributedTo {'@id': '7f3da208-4323-4ea0-b1fe-3d9401586be8', 'subject_id': 'es_5'}
isAbout https://raw.githubusercontent.com/sanuann/covid19/master/activity/covid19/items/covid19_status_symptoms_positive
value [1, 2, 3, 6, 9]
===================
@context https://raw.githubusercontent.com/ReproNim/reproschema/1.0.0-rc2/contexts/generic
@type reproschema:ResponseActivity
@id uuid:4922037d-a8cf-48bb-be91-6c4f0a607574
used ['https://raw.githubusercontent.com/sanuann/covid19/master/activity/covid19/items/covid19_days', 'https://raw.githubusercontent.com/sanuann/covid19/master/activity/covid19/covid19_schema', 'https://raw.githubusercontent.com/sanuann/covid19/master/protocol/Covid19_schema']
inLanguage es
startedAtTime 2021-02-07T02:43:29.812Z
endedAtTime 2021-02-07T02:43:40.824Z
wasAssociatedWith {'version': '0.0.1', 'url': 'https://sensein.github.io/covid19/', '@id': 'https://github.com/ReproNim/reproschema-ui'}
generated uuid:095a1bda-6ea9-4503-9752-ba85733db7fd
===================
@context https://raw.githubusercontent.com/ReproNim/reproschema/1.0.0-rc2/contexts/generic
@type reproschema:Response
@id uuid:095a1bda-6ea9-4503-9752-ba85733db7fd
wasAttributedTo {'@id': '7f3da208-4323-4ea0-b1fe-3d9401586be8', 'subject_id': 'es_5'}
isAbout https://raw.githubusercontent.com/sanuann/covid19/master/activity/covid19/items/covid19_days
value 2
===================
@context https://raw.githubusercontent.com/ReproNim/reproschema/1.0.0-rc2/contexts/generic
@type reproschema:ResponseActivity
@id uuid:bf3609f4-e688-4ecb-ab89-9725e0315701
used ['https://raw.githubusercontent.com/sanuann/covid19/master/activity/covid19/items/fever_positive', 'https://raw.githubusercontent.com/sanuann/covid19/master/activity/covid19/covid19_schema', 'https://raw.githubusercontent.com/sanuann/covid19/master/protocol/Covid19_schema']
inLanguage es
startedAtTime 2021-02-07T02:43:40.824Z
endedAtTime 2021-02-07T02:43:56.274Z
wasAssociatedWith {'version': '0.0.1', 'url': 'https://sensein.github.io/covid19/', '@id': 'https://github.com/ReproNim/reproschema-ui'}
generated uuid:85e4c087-c159-4322-94f2-bc1dec4ef64d
===================
@context https://raw.githubusercontent.com/ReproNim/reproschema/1.0.0-rc2/contexts/generic
@type reproschema:Response
@id uuid:85e4c087-c159-4322-94f2-bc1dec4ef64d
wasAttributedTo {'@id': '7f3da208-4323-4ea0-b1fe-3d9401586be8', 'subject_id': 'es_5'}
isAbout https://raw.githubusercontent.com/sanuann/covid19/master/activity/covid19/items/fever_positive
value {'value': '38', 'unitCode': 'Celsius'}
===================
@context https://raw.githubusercontent.com/ReproNim/reproschema/1.0.0-rc2/contexts/generic
@type reproschema:ResponseActivity
@id uuid:7bb35b7f-6b9c-4ad0-9d8e-f92132a9702f
used ['https://raw.githubusercontent.com/sanuann/covid19/master/activity/covid19/items/covid19_status_symptoms_negative', 'https://raw.githubusercontent.com/sanuann/covid19/master/activity/covid19/covid19_schema', 'https://raw.githubusercontent.com/sanuann/covid19/master/protocol/Covid19_schema']
inLanguage es
startedAtTime 2021-02-07T02:43:56.274Z
endedAtTime 2021-02-07T02:44:05.944Z
wasAssociatedWith {'version': '0.0.1', 'url': 'https://sensein.github.io/covid19/', '@id': 'https://github.com/ReproNim/reproschema-ui'}
generated uuid:8420c0f2-b97b-4022-95dc-706818d39fc0
===================
@context https://raw.githubusercontent.com/ReproNim/reproschema/1.0.0-rc2/contexts/generic
@type reproschema:Response
@id uuid:8420c0f2-b97b-4022-95dc-706818d39fc0
wasAttributedTo {'@id': '7f3da208-4323-4ea0-b1fe-3d9401586be8', 'subject_id': 'es_5'}
isAbout https://raw.githubusercontent.com/sanuann/covid19/master/activity/covid19/items/covid19_status_symptoms_negative
value [2, 3, 6, 8, 9]
===================
@context https://raw.githubusercontent.com/ReproNim/reproschema/1.0.0-rc2/contexts/generic
@type reproschema:ResponseActivity
@id uuid:0208db35-d155-4ed6-97f8-e66dcb87aafe
used ['https://raw.githubusercontent.com/sanuann/covid19/master/activity/covid19/items/fever_negative', 'https://raw.githubusercontent.com/sanuann/covid19/master/activity/covid19/covid19_schema', 'https://raw.githubusercontent.com/sanuann/covid19/master/protocol/Covid19_schema']
inLanguage es
startedAtTime 2021-02-07T02:44:05.944Z
endedAtTime 2021-02-07T02:44:15.630Z
wasAssociatedWith {'version': '0.0.1', 'url': 'https://sensein.github.io/covid19/', '@id': 'https://github.com/ReproNim/reproschema-ui'}
generated uuid:306dc527-f9c6-43f7-8135-31e092baf85c
===================
@context https://raw.githubusercontent.com/ReproNim/reproschema/1.0.0-rc2/contexts/generic
@type reproschema:Response
@id uuid:306dc527-f9c6-43f7-8135-31e092baf85c
wasAttributedTo {'@id': '7f3da208-4323-4ea0-b1fe-3d9401586be8', 'subject_id': 'es_5'}
isAbout https://raw.githubusercontent.com/sanuann/covid19/master/activity/covid19/items/fever_negative
value {'value': '35', 'unitCode': 'Celsius'}
===================
@context https://raw.githubusercontent.com/ReproNim/reproschema/1.0.0-rc2/contexts/generic
@type reproschema:ResponseActivity
@id uuid:d3dc8717-04aa-4f41-affe-ce0ac956f490
used ['https://raw.githubusercontent.com/sanuann/covid19/master/activity/covid19/items/covid19_clinical_history', 'https://raw.githubusercontent.com/sanuann/covid19/master/activity/covid19/covid19_schema', 'https://raw.githubusercontent.com/sanuann/covid19/master/protocol/Covid19_schema']
inLanguage es
startedAtTime 2021-02-07T02:44:15.630Z
endedAtTime 2021-02-07T02:44:41.182Z
wasAssociatedWith {'version': '0.0.1', 'url': 'https://sensein.github.io/covid19/', '@id': 'https://github.com/ReproNim/reproschema-ui'}
generated uuid:a5031603-0063-473a-b52a-f9d44ed18dec
===================
@context https://raw.githubusercontent.com/ReproNim/reproschema/1.0.0-rc2/contexts/generic
@type reproschema:Response
@id uuid:a5031603-0063-473a-b52a-f9d44ed18dec
wasAttributedTo {'@id': '7f3da208-4323-4ea0-b1fe-3d9401586be8', 'subject_id': 'es_5'}
isAbout https://raw.githubusercontent.com/sanuann/covid19/master/activity/covid19/items/covid19_clinical_history
value [5, 9, 8]

So then to build the dataset, I'd take every @type section that is equal to reproschema:Response, then obtain the item name from isAbout and get the values, we'd get something like this:

files = os.listdir(input_dir)
to_remove = ['.DS_Store','files_size.csv','list_completed_protocols.py']
submission_dirs= [n for n in files if n not in to_remove]
df = []
for submission_dirs_i in submission_dirs:
responses_participant_i = {}

for activity_N in range(n_activities):
    with open(input_dir + submission_dirs_i + f'/activity_{activity_N}.jsonld', 'r') as f:
        activity = json.load(f)

    for d in data:
        responses = {}
        print('\n===================')
        for key, value in d.items():
            print(key, value)

    for item in activity:
        if item.get('@type') == 'reproschema:Response':
            item_name = item.get('isAbout').split('/')[-1]
            item_name = f'act{activity_N}_{item_name}' #to specify the activity of this item
            item_response = item.get('value')
            responses_participant_i[item_name] = str(item_response) #make everything a string so i'm able to turn all types of responses into a DF

uid = item.get('wasAttributedTo').get('subject_id')
responses_participant_i['uid'] = uid
df.append(responses_participant_i)

df = pd.DataFrame(df, index = range(len(df)))

(1 row=1 protocol) image

Other things that could be added are timestamps for each item/activity, language (which should be the same, but maybe they switch half way)

@satra, would it be more useful to parse use something like rdflib-jsonld to obtain the graph?

satra commented 3 years ago

pyld would the tool to use. there is a function in reproschema-py that will do the "right thing" to reading any jsonld file that is on the filesystem (i.e. data served locally). and rdflib can definitely be used to query the data or convert to a different form.