htwangtw / adie_ongoingthoughts

ADIE ongoing thought related analysis plan
MIT License
1 stars 2 forks source link

Create phenotype_parse.py #19

Open JoelPatchitt opened 3 years ago

JoelPatchitt commented 3 years ago

Hey Hao-Ting,

There was an issue with the previous script where session 'F' shared the letter F with columns elsewhere in the data sheet. This caused some basline data to seep into the follow up data. I know there is a better way to fix this issue than the solution that I have introduced as 'parse_phenotype' (probably using re function), but i have tried and tried and ended up throwing together this ham-fisted solution that did the trick.

please see the new script under the name - parse_phenotype

htwangtw commented 3 years ago

I had a look at the phenotype data you organised. It is awesome! really well done! :tada:

Two minor points:

  1. The session labels are not matching the ones we use for BIDS behavioural data. The problem with the current labels is they are a bit vague with meanings, and when people use the session name in SPSS, they are not cannot consistently prepend the session name to the measurement. This is because SPSS doesn't allow variable names start with numbers. Here's a python dictionary to translate the session name in source data and more consistant ones:
    sessions = {
    "BL": "baseline",
    "F": "oneweek",  # confirmed by lisa
    "3mf": "threemonth",
    "FY": "oneyear"
    }
  2. we can remove the session information in the header of the TSV, as it's already specified in the file name.

Again really well done!

JoelPatchitt commented 3 years ago

Thank you for the feedback, I will create some comments that help with the script.

The only issue I am having is with the column names that I have purposley left unquoted on lines 41 43 and 45 (please see the replies to your comments above).

They are too ambiguous or appear in other columns, which picks up those columns when putting the new dataframe together. I was asking if there was a solution to solve this ambiguity (possibly using regex)?

htwangtw commented 3 years ago

First of all, if the meaning is ambiguous, regex won't help. I am not sure why you are so fixated on the idea of using regex.

All you need to do is this for example:

my_list = ["F_stuff", "BL_nostuff", "BL_stuff", "BL_stuffnotthisone"]

collect_stuff = [item for item in my_list if "_stuff"  == item[-6:]]  # match last 6 character

Regex doesn't make it faster since you still need to iterate through the list. For a simple pattern, regex is overkill. You won't really encounter cases that need regex unless working with more complex data. Here's a solution in regex, not as readable as the original unless you read regex. Regex is not a thing worth spending your time on for the current stage.

import re

my_list = ["F_stuff", "BL_nostuff", "BL_stuff", "BL_stuffnotthisone"]

collect_stuff = [item for item in my_list if re.search(r"_stuff$", item)  is not None]