htwangtw / adie_ongoingthoughts

ADIE ongoing thought related analysis plan
MIT License
1 stars 2 forks source link

Create generate_participant.py #11

Closed JoelPatchitt closed 3 years ago

JoelPatchitt commented 3 years ago

This branch will resolve issue #10

htwangtw commented 3 years ago

Do you think we can delete generate_participant_splitpath.py now since we have a good script to be adapted as a final product? It will not be good bye to this file forever. If you want to find it, you can find it in the history.

JoelPatchitt commented 3 years ago

Yes. I'll delete it now.


From: Hao-Ting Wang notifications@github.com Sent: 21 January 2021 16:37:45 To: htwangtw/adie_ongoingthoughts Cc: Joel Patchitt; Assign Subject: Re: [htwangtw/adie_ongoingthoughts] Create generate_participant.py (#11)

Do you think we can delete generate_participant_splitpath.py now since we have a good script to be adapted as a final product? It will not be good bye to this file forever. If you want to find it, you can find it in the history.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/htwangtw/adie_ongoingthoughts/pull/11#issuecomment-764776485, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ASQFE2GOCFRTOMNJMONKCZLS3BJ5TANCNFSM4WHDDU4A.


CAUTION: This email may have originated from outside of the university. Do not click links or open attachments unless you recognise the sender and know the content is safe.

JoelPatchitt commented 3 years ago

Typo fixed

JoelPatchitt commented 3 years ago

Is this what you are looking for? I always end up getting confused when you ask me to set the path to github!!

I ran the function on my latop with a cwd() path and it works.

from pathlib import Path
import os
import pandas as pd

data = Path("htwangtw/adie_ongoingthoughts/adie/tests/data/7t_trt/")

def generate_participants():
    subj = list(data.glob("Sub-*")) # Lists directories
    sub_str = [str(e) for e in subj] # Subjects as string, converts elements from windows path to string retaining list format 

    sub_id = []
    while sub_str:  # stop when sub_str is empty
        cur_sub = sub_str.pop()  # pop an item from the list
        cur_sub = cur_sub.split(os.sep) # split string by os specific separator, return a list of strings
        sub_id.append(cur_sub[-1]) # save output

    #Convert list into a a dateframe
    df = pd.DataFrame(sub_id,columns=['participant_id'])
    print (df)

    df.to_csv('participants.tsv', sep='\t', index=False) # Output to .tsv file
htwangtw commented 3 years ago

Sorry I didn't get back to you for an epic weekend challenge as promised.

Here's a small one - Can you modify this function to accept different BIDS dataset and produce the same kind of file?

generate_participants("/path/to/dataset 1")
generate_participants("/path/to/dataset 2")

Both case should return participants.tsv in the bids directory.

JoelPatchitt commented 3 years ago

When you say different BIDS dataset, do you mean a .xlsx file or something else?

htwangtw commented 3 years ago

If you have a look at the test data: https://github.com/htwangtw/adie_ongoingthoughts/tree/main/adie/tests/data

You can find 4 directories - each of them is a dataset in BIDS format (hence "BIDS dataset") I would like you to modify your code and test them on these three:

Does this explanation help?

JoelPatchitt commented 3 years ago

Yes this helps. I am struggling to get that file path downloaded onto my computer, Those directories exist in your repository but not mine. How do I pull them over to mine so that i can run the tests on my computer?

JoelPatchitt commented 3 years ago

Python is telling me I cannot use quotations as a function's argument, im pretty lost here.

JoelPatchitt commented 3 years ago

Hey Hao-Ting,

I am now connected to the cisc volumes and analysis server. Please see below the funciton you asked for. let me know if its wrong! I await further orders!


def generate_participants(datafile):
    subj = list(datafile.glob("Sub-*")) # Lists directories
    sub_str = [str(e) for e in subj] # Subjects as string, converts elements from windows path to string retaining list format 

    sub_id = []
    while sub_str:  # stop when sub_str is empty
        cur_sub = sub_str.pop()  # pop an item from the list
        cur_sub = cur_sub.split(os.sep) # split string by os specific separator, return a list of strings
        sub_id.append(cur_sub[-1]) # save output

    #Convert list into a a dateframe
    df = pd.DataFrame(sub_id,columns=['participant_id'])
JoelPatchitt commented 3 years ago

Imports neccesary:


import pandas as pd
import os
from pathlib import Path

path = Path.path/to/your/data
htwangtw commented 3 years ago

Update your script in bin I will review it There's one typo and I think it will soon be ready to ship!

JoelPatchitt commented 3 years ago

Hey @htwangtw ,

I have been runnning a little side project that I have managed to I finish that I think you might be interested in.

Please see below a function that grabs the subject numbers from the directory names, as per our current script, but also enters each folder and extracts .xlsx data (I created a fake excel datasheet with age gender and handedness for each participant).

I have tried to keep the function general so that it will work on all operating systems & filepaths.

I know that this might not be useful to the current study, or maybe it will, what do you think? Are there any corrections that could be made?

# Imported functions
import pandas as pd
from pathlib import Path
import glob
import os

# File paths
path = Path.cwd() # Can be modified to directory
sub_dir = "Sub_*\\*_datasheet.xlsx" # Can be modified to suit sub-directoies

# Data extraction function
def generate_participants(datafile):
    data_loc = os.path.join(path, sub_dir)
    subj = list(datafile.glob("Sub_*")) # Lists directories
    sub_str = [str(e) for e in subj] # Subjects as string, converts elements from windows path to string retaining list format 
    subjdata = glob.glob(data_loc, recursive=True)
    sub_id = []
    df = pd.DataFrame()

# Extract subject number from directory name
    while sub_str:  # Stop when sub_str is empty
        cur_sub = sub_str.pop()  # Pop an item from the list
        cur_sub = cur_sub.split(os.sep) # Split string by os specific separator, return a list of strings
        sub_id.append(cur_sub[-1]) # Save output

    sub_id = pd.DataFrame(sub_id, columns=['participant_id']) # Create dataframe
    sub_id["participant_id"] = sub_id["participant_id"].values[::-1] # Subject ID's flipped for some reason

# Extract .xlsx datafile from subject's directory
    for file in subjdata:
        if file.endswith('.xlsx'):
            df = df.append(pd.read_excel(file), ignore_index=True)

# Concatonate dataframes        
    df2 = pd.concat([sub_id, df], axis=1)
# Convert to .tsv file
    df2.to_csv('participants.tsv', sep='\t', index=False) # Output to .tsv file

Here is the output: participant_id age gender handedness Sub_01 14 m r Sub_02 36 m r Sub_03 26 f l Sub_04 18 f l

htwangtw commented 3 years ago

I like your attempt to address other variables that can be include in the participants.tsv A lot of those demographic information can be found in assessment data provided from Lisa. Do you want to open a separate issue on that?

JoelPatchitt commented 3 years ago

Sure, Just tell me what needs to be done and I will give it a shot.