ncbi / gaptools

dbGaP data validation tool repo
Other
11 stars 11 forks source link

Can gaptools be run on only the subject phenotype files? #16

Open DanielEWeeks opened 1 year ago

DanielEWeeks commented 1 year ago

The gaptools documentation mentions only two required files: the subject consent and the subject sample map files.

So I was wondering if it could be run on input that contains those two plus the subject phenotype files without providing any genotype data? Or is it required that genotype data be present for a run of gaptools?

For example, if I run it on a made-up test data set with this json below, it doesn't do anything beyond generating a metrics-CHECK_METADATA_FILES.json file containing the following:

[
    {
        "dag_id": "GapTools",
        "run_id": "2023-02-08-13-57-32",
        "dag_task": "CHECK_METADATA_FILES",
        "metric_name": "dag_run_start_time",
        "metric_value": "2023-02-08 13:57:45",
        "metric_object": ""
    }
]

Input meta data json file

{
  "NAME": "Example 1",
  "FILES": [
    {
      "name": "ExampleConsent_DS.txt",
      "type": "subject_consent_file"
    },
    {
      "name": "ExampleConsent_DD.xlsx",
      "type": "subject_consent_data_dictionary_file"
    },
    {
      "name": "ExampleSSM_DS.txt",
      "type": "subject_sample_mapping_file"
    },
    {
      "name": "ExampleSSM_DD.xlsx",
      "type": "subject_sample_mapping_data_dictionary_file"
    },
    {
      "name": "DS_Example.txt",
      "type": "phenotype_ds"
    },
    {
      "name": "3b_SSM_DD_Example1.xlsx",
      "type": "phenotype_dd"
    }
  ]
}
mfeolo commented 1 year ago

Hi Daniel, Thank you for your feedback. You can run the tool with just the phenotype and id data files. Regards, Mike

Michael Feolo Staff Scientist, dbGaP Team Lead National Center for Biotechnology Information Building 45, 4AN.12B, MSC 6514 Bethesda, MD 20894 Phone: 301.402.2874 Email: @.***

From: Daniel E. Weeks @.> Sent: Wednesday, February 8, 2023 9:49 AM To: ncbi/gaptools @.> Cc: Subscribed @.***> Subject: [EXTERNAL] [ncbi/gaptools] Can gaptools be run on only the subject phenotype files? (Issue #16)

The gaptools documentation mentions only two required files: the subject consent and the subject sample map files.

So I was wondering if it could be run on input that contains those two plus the subject phenotype files without providing any genotype data? Or is it required that genotype data be present for a run of gaptools?

For example, if I run it on a made-up test data set with this json below, it doesn't do anything beyond generating a metrics-CHECK_METADATA_FILES.json file containing the following:

[

{

    "dag_id": "GapTools",

    "run_id": "2023-02-08-13-57-32",

    "dag_task": "CHECK_METADATA_FILES",

    "metric_name": "dag_run_start_time",

    "metric_value": "2023-02-08 13:57:45",

    "metric_object": ""

}

]

Input meta data json file

{

"NAME": "Example 1",

"FILES": [

{

  "name": "ExampleConsent_DS.txt",

  "type": "subject_consent_file"

},

{

  "name": "ExampleConsent_DD.xlsx",

  "type": "subject_consent_data_dictionary_file"

},

{

  "name": "ExampleSSM_DS.txt",

  "type": "subject_sample_mapping_file"

},

{

  "name": "ExampleSSM_DD.xlsx",

  "type": "subject_sample_mapping_data_dictionary_file"

},

{

  "name": "DS_Example.txt",

  "type": "phenotype_ds"

},

{

  "name": "3b_SSM_DD_Example1.xlsx",

  "type": "phenotype_dd"

}

]

}

- Reply to this email directly, view it on GitHubhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fncbi%2Fgaptools%2Fissues%2F16&data=05%7C01%7Cfeolo%40ncbi.nlm.nih.gov%7C43f914e3761a4e19bbb508db09e3a785%7C14b77578977342d58507251ca2dc2b06%7C0%7C0%7C638114645599157878%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=QG86RXA94L9fovDmEiKsWXb6akK1tlZKrMU4L5uyp7U%3D&reserved=0, or unsubscribehttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FARSL2T4XIFSNO57CWQFRPPDWWOW6XANCNFSM6AAAAAAUVLKAK4&data=05%7C01%7Cfeolo%40ncbi.nlm.nih.gov%7C43f914e3761a4e19bbb508db09e3a785%7C14b77578977342d58507251ca2dc2b06%7C0%7C0%7C638114645599157878%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ur6M5044Q3uLza9OLT%2BpcNerRhjfaq3VjcqxL%2FZTYWk%3D&reserved=0. You are receiving this because you are subscribed to this thread.Message ID: @.**@.>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and are confident the content is safe.

DanielEWeeks commented 1 year ago

Dear Mike,

Thank you for your reply. While I got gaptools to run just fine on the provided 1000 Genomes example file, it is not running on this made-up example data (link below) and I have no idea why. All it produces in the 'metrics' folder is:

$ more metrics-CHECK_METADATA_FILES.json 
[
    {
        "dag_id": "GapTools",
        "run_id": "2023-02-08-16-23-10",
        "dag_task": "CHECK_METADATA_FILES",
        "metric_name": "dag_run_start_time",
        "metric_value": "2023-02-08 16:23:23",
        "metric_object": ""
    }
]

I tried it with different variations of the json, first with just the phenotype files, and then with the phenotype, consent, and sample subject map files, and now with more. All of those variations generate only the metrics-CHECK_METADATA_FILES.json file as seen above.

Thanks, Dan

Link to made-up example data:

https://pitt-my.sharepoint.com/:u:/g/personal/weeks_pitt_edu/EchfcjQJEEZCsV5RtZMIjIIBCv0eJE40OXeQqcr9lQ7J3w?e=ALn8Gf

DanielEWeeks commented 1 year ago

I found my mistake:

I didn't notice this line in the instructions:

"The file has to be named metadata.json"

so it was failing because I had named it 'Example1.json' instead.

Suggestion:

I made this mistake in part because I assumed this option in the example command:

-m ./input_files/1000_Genomes_Study/metadata.json

would allow me to change the name of the json meta data file.

Since the -m option is only meant to allow the user to change the path to the metadata.json file and not to allow them to change the name itself, it would be better if you changed the option to only ask the user the path by shortening it to:

-m ./input_files/1000_Genomes_Study/
mfeolo commented 1 year ago

Hi Daniel, Thanks for following up. You beat us to it. We will take your feedback into account to make the documentation better. Regards, Mike

From: Daniel E. Weeks @.> Sent: Thursday, February 9, 2023 5:56 PM To: ncbi/gaptools @.> Cc: Feolo, Mike (NIH/NLM/NCBI) [E] @.>; Comment @.> Subject: [EXTERNAL] Re: [ncbi/gaptools] Can gaptools be run on only the subject phenotype files? (Issue #16)

I found my mistake:

I didn't notice this line in the instructions:

"The file has to be named metadata.json"

so it was failing because I had named it 'Example1.json' instead.

Suggestion:

I made this mistake in part because I assumed this option in the example command:

-m ./input_files/1000_Genomes_Study/metadata.json

would allow me to change the name of the json meta data file.

Since the -m option is only meant to allow the user to change the path to the metadata.json file and not to allow them to change the name itself, it would be better if you changed the option to only ask the user the path by shortening it to:

-m ./input_files/1000_Genomes_Study/

- Reply to this email directly, view it on GitHubhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fncbi%2Fgaptools%2Fissues%2F16%23issuecomment-1424948085&data=05%7C01%7Cfeolo%40ncbi.nlm.nih.gov%7C852e2696b2f54028a2c908db0af0da66%7C14b77578977342d58507251ca2dc2b06%7C0%7C0%7C638115801817060395%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ICw5YsiHZ2knsr83gLgSxIozyF1AwYu23TGOSHcWVYE%3D&reserved=0, or unsubscribehttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FARSL2T3BU4OXIGBGSYVLZ4TWWVYZBANCNFSM6AAAAAAUVLKAK4&data=05%7C01%7Cfeolo%40ncbi.nlm.nih.gov%7C852e2696b2f54028a2c908db0af0da66%7C14b77578977342d58507251ca2dc2b06%7C0%7C0%7C638115801817060395%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=1NyXhHRBNqd49bI5gwk8zbcoUoRnvp%2FOfUd1aotyT0s%3D&reserved=0. You are receiving this because you commented.Message ID: @.**@.>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and are confident the content is safe.