stephenslab / gtexresults

Code and data resources accompanying Urbut et al (2017), "Flexible statistical methods for estimating and testing effects in genomic studies with multiple conditions."
https://bit.ly/2FSUxny
MIT License
23 stars 11 forks source link

Output target of step default_1 not found #17

Open aleksicmilica-sbg opened 2 years ago

aleksicmilica-sbg commented 2 years ago

Hi,

I am trying to run fastqtl_to_mash.ipynb script to convert EMBL eQTL catalogue data (BLUEPRINT dataset) to MASHR format. I am receiving the following error:

INFO: Running default_1: Convert summary stats gzip format to HDF5
INFO: default_1 (index=0) is completed.
INFO: default_1 (index=1) is completed.
INFO: default_1 (index=2) is completed.
INFO: output:   /opt/gtexresults/fastqtl_to_mash_output/BLUEPRINT.neutrophil.test.tsv.h5 /opt/gtexresults/fastqtl_to_mash_output/BLUEPRINT.tcell.test.tsv.h5... (3 items in 3 groups)
ERROR: [default_1]: [default_1]: Output target /opt/gtexresults/fastqtl_to_mash_output/BLUEPRINT.neutrophil.test.tsv.h5 does not exist after the completion of step default_1
[default]: 3 pending steps: default_2, default_3, default_4

Here are the execution details:

Could you please help me figure out this error?

Thanks in advance!

milica

P.S. The entire documentation is phenomenal. Especially enjoyed reading these pages, it's so detailed and precise. Thank you!

gaow commented 2 years ago

Thanks @aleksicmilica-sbg just letting you know that I see this issue and I'll update you hopefully within a week. We have an updated version to this procedure that I need to finalize. I might direct you elsewhere about it because this repo is meant to reproduce GTEx analysis so I'm not going to update on it anymore.

aleksicmilica-sbg commented 2 years ago

Thank you @gaow for your quick reply, I am looking forward to seeing the updated procedure! milica

aleksicmilica-sbg commented 2 years ago

Hi @gaow , are there any updates on this? Thanks! :)

gaow commented 2 years ago

Unfortunately it's still work in progress here -- we are retiring the HDF5 format and use a VCF format instead for summary stats; and we change the way we compute priors. The timeline is still 2 weeks from now as we are short-handed here on the data anlaysis.

Let me help you debug the HDF5 pipeline though: can you do sos dryrun instead of sos run so you can print the actual command or script being used; then you can try to run that script directly see what gives an error? It looks the error is somewhat "silent" because otherwise SoS would have reported that. Running the script directly should help with it.

carmacrea commented 1 year ago

any update about the error?

gaow commented 1 year ago

@carmacrea did it work when you run our minimal working example? We have been using the same workflow logic ourselves without an issue so the only part I can think of is the HDF5 i/o but it would be great if you could verify with the minimal working example we provided.

Our new procedure involves performing univariate fine-mapping, saving results in RDS format or VCF format then query the top signals from those credible sets rather than the top SNP per gene. We have been doing that for our own analysis although we are still working on a new procedure for generating the mixture model (with @yunqiyang0215 ) before releasing an update. If you have not done fine-mapping then it is perhaps still better off figuring out what's going on with the HDF5 i/o.

carmacrea commented 1 year ago

yes, it worked when I run the example, but I use a file generate with tensorqtl because fastqtl is no longer maintained so I don´t know if this is the problem. Also my file format is variant_id instead of gene_id (I only had variant_id).

gaow commented 1 year ago

@carmacrea do you think it is possible for you to modify our minimal working example to reproduce your error and share it here?

carmacrea commented 1 year ago

I am not sure because in my case the gene_id/phenotype_id is bowelcorrdct so I don´t know how to process

carmacrea commented 1 year ago

I try again and I had this error: ERROR: Failed to connect to : ssh: Could not resolve hostname : Bad value for ai_flags

ERROR: [default_1]: [f2663c61ac68ead7]: Failed to connect to : ssh: Could not resolve hostname : Bad value for ai_flags

[default]: 3 pending steps: default_2, default_3, default_4

yangchuhua commented 1 year ago

I am not the root user. I got the following error report when I ran my own data, but it worked well when I ran the example data you shared.

fastqtl2mash-docker sos run workflows/fastqtl_to_mash.ipynb \
  --cwd fastqtl_to_mash_output \
  --data_list data/test/test.list \
  --gene_list data/test/test.txt \
  --cols 3 4 5 \
  -j 8 \
  -v 3

DEBUG: R library rhdf5 (2.30.1) is available
INFO: Running default_1: Convert summary stats gzip format to HDF5
DEBUG: _input: data/test/test_1.tsv.gz
DEBUG: Signature mismatch: Missing target  /gtexresults/fastqtl_to_mash_output/test_1.tsv.h5
DEBUG: _input: data/test/test_2.tsv.gz
DEBUG: Signature mismatch: Missing target  /gtexresults/fastqtl_to_mash_output/test_2.tsv.h5
INFO: default_1 (index=1) is completed.
DEBUG: Failed to create signature: output target  /gtexresults/fastqtl_to_mash_output/test_2.tsv.h5 does not exist
DEBUG: Failed to write signature 662dbe61a59ea03a
INFO: default_1 (index=0) is completed.
DEBUG: Failed to create signature: output target  /gtexresults/fastqtl_to_mash_output/test_1.tsv.h5 does not exist
DEBUG: Failed to write signature 3c0121cfe29ddedc
INFO: output:    /gtexresults/fastqtl_to_mash_output/test_1.tsv.h5  /gtexresults/fastqtl_to_mash_output/test_2.tsv.h5 in 2 groups
  File "/opt/conda/lib/python3.7/site-packages/sos/step_executor.py", line 1999, in run
    yreq = runner.send(yres)
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/sos/step_executor.py", line 1999, in run
    yreq = runner.send(yres)
  File "/opt/conda/lib/python3.7/site-packages/sos/step_executor.py", line 1878, in run
    self.verify_output()
  File "/opt/conda/lib/python3.7/site-packages/sos/step_executor.py", line 450, in verify_output
    f'Output target {target} does not exist after the completion of step {env.sos_dict["step_name"]}'
RuntimeError: Output target  /gtexresults/fastqtl_to_mash_output/test_1.tsv.h5 does not exist after the completion of step default_1
DEBUG: Step default_1 failed
  File "/opt/conda/lib/python3.7/site-packages/sos/__main__.py", line 552, in cmd_run
    executor.run(args.__targets__, mode=config['run_mode'])
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/sos/__main__.py", line 552, in cmd_run
    executor.run(args.__targets__, mode=config['run_mode'])
  File "/opt/conda/lib/python3.7/site-packages/sos/workflow_executor.py", line 341, in run
    return self.run_as_master(targets=targets, mode=mode)
  File "/opt/conda/lib/python3.7/site-packages/sos/workflow_executor.py", line 1561, in run_as_master
    raise exec_error
sos.executor_utils.ExecuteError: [default_1]: [default_1]: Output target  /gtexresults/fastqtl_to_mash_output/test_1.tsv.h5 does not exist after the completion of step default_1
[default]: 3 pending steps: default_2, default_3, default_4
ERROR: [default_1]: [default_1]: Output target  /gtexresults/fastqtl_to_mash_output/test_1.tsv.h5 does not exist after the completion of step default_1
[default]: 3 pending steps: default_2, default_3, default_4
fastqtl2mash-docker sos run workflows/fastqtl_to_mash.ipynb \
  --data_list data/fastqtl/FastQTLSumStats.list \
  --gene_list data/fastqtl/GTEx_genes.txt \
  -j 8

INFO: Running default_1: Convert summary stats gzip format to HDF5
INFO: default_1 (index=0) is completed.
INFO: default_1 (index=1) is completed.
INFO: output:    /gtexresults/fastqtl_to_mash_output/Tissue_2.fastqtl.h5  /gtexresults/fastqtl_to_mash_output/Tissue_1.fastqtl.h5 in 2 groups
INFO: Running default_2: Merge single study data to multivariate data
INFO: default_2 is completed.
INFO: output:    /gtexresults/fastqtl_to_mash_output/FastQTLSumStats.h5
INFO: Running default_3: Extract data to fit MASH model
INFO: default_3 is completed.
INFO: output:    /gtexresults/fastqtl_to_mash_output/FastQTLSumStats.portable.h5
INFO: Running default_4: Subset and split data, generate Z-score and save to RDS
INFO: default_4 is completed.
INFO: output:    /gtexresults/fastqtl_to_mash_output/FastQTLSumStats.mash.rds
INFO: Workflow default (ID=e27dcc0f542cb7f3) is executed successfully with 4 completed steps and 5 completed substeps.
carmenmacr11 commented 1 year ago

Yes, it has worked for my example. Now what I don't know is how to carry out the MASHR analysis with the given data, as I use this command but it doesn't work: fastqtl2mash-singularity sos run mashr_flashr_workflow.ipynb mash --data ../data/FastQTLSumStats.mash.rds and it gives me the following error: INFO: Running vhat_mle: V estimate: "mle" method INFO: Running pca: INFO: Running flash_nonneg: Perform FLASH analysis with non-negative factor constraint (time estimate: 20min) INFO: Running vhat_simple: V estimate: "simple" method (using null z-scores) INFO: Running flash: Perform FLASH analysis with non-negative factor constraint (time estimate: 20min) ERROR: flash_nonneg (id=831e2a240d34d36d) returns an error. ERROR: pca (id=cd2135596b55fb49) returns an error. ERROR: vhat_simple (id=a6a6ac2bd80c0140) returns an error. ERROR: flash (id=16103308e4081ad1) returns an error.

yangchuhua commented 1 year ago

Hi,

I am trying to run fastqtl_to_mash.ipynb script to convert EMBL eQTL catalogue data (BLUEPRINT dataset) to MASHR format. I am receiving the following error:

INFO: Running default_1: Convert summary stats gzip format to HDF5
INFO: default_1 (index=0) is completed.
INFO: default_1 (index=1) is completed.
INFO: default_1 (index=2) is completed.
INFO: output:   /opt/gtexresults/fastqtl_to_mash_output/BLUEPRINT.neutrophil.test.tsv.h5 /opt/gtexresults/fastqtl_to_mash_output/BLUEPRINT.tcell.test.tsv.h5... (3 items in 3 groups)
ERROR: [default_1]: [default_1]: Output target /opt/gtexresults/fastqtl_to_mash_output/BLUEPRINT.neutrophil.test.tsv.h5 does not exist after the completion of step default_1
[default]: 3 pending steps: default_2, default_3, default_4

Here are the execution details:

  • Data is in gzip compressed tab separated txt file with header containing the following columns: gene_name,snp_id, beta, se, pval). Each tissue file contains 10k SNPs, since at the moment I am testing the workflow.
  • BLUEPRINT.tissues.list contains a list of relative paths to individual tissue files
  • I am runnining inside a docker container pulled from here
  • I am able to run the example command line with dummy data from the instructions
  • My command line is:
sos run workflows/fastqtl_to_mash.ipynb --data_list blpt-test/BLUEPRINT.tissues.list --cols 4 5 3 --gene-list blpt-test/genes.sorted.uniq.txt -j 1

Could you please help me figure out this error?

Thanks in advance!

milica

P.S. The entire documentation is phenomenal. Especially enjoyed reading these pages, it's so detailed and precise. Thank you!

I was able to resolve the error by ensuring that the columns in my data were the same as the example data, with the same order. While the 'fastqtl_to_mash.ipynb' file provides a description of the requirements for 5 columns, it seems that ensuring the columns match the example data is also necessary for the code to run successfully.

Please restructure your input data of 'fastqtl_to_mash.ipynb' following the 9 columns as the following:

gene_id 
variant_id      
tss_distance    
ma_samples      
ma_count        
maf     
pval_nominal    
slope   
slope_se