missing pickle file - Githubissues

hschult commented 3 years ago

Call: python bin/tf_analyzer.py -g mm9 -b liver

Error:

76 of 152files were successfully normalised. If not all files were normalised, check logging for further information.
Traceback (most recent call last):
  File "bin/tf_analyzer.py", line 237, in <module>
    main()
  File "bin/tf_analyzer.py", line 211, in main
    scores, exist = scripts.score.findarea(args.width, args.genome.lower(), [x.lower() for x in args.biosource],
  File "/mnt/workspace/hschult/jlu-bda-2020/bin/scripts/score.py", line 33, in findarea
    atacdict = pickle.load(open(os.path.join(picklepath, genom, 'atac-seq', biosource + ".pickle"), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/workspace/hschult/jlu-bda-2020/data/pickledata/mm9/atac-seq/liver.pickle'

There are no pickle files in the folder given in the error message.

fctsfrmspc commented 3 years ago

generate_data.py skips the validation, sorting and pickle creation steps when no new data was downloaded. So if an error occurred during that process in a previous run, the pipeline will attempt to skip these steps again and always fail. @Jasmin-Walter addressed this in 15db5ef472973f291f23ef09984589b7370a9c2e by adding a --redo_file_validation argument but it still needs to be implemented in generate_data.py.

rwiegan commented 3 years ago

The problem still persists. I tried to run with the --redo_analysis

python bin/tf_analyzer.py -g mm9 -b liver -t gata4 polr2a -c chr1 --redo_analysis
fetching 20 ATAC/DNAse-seq experiments ...
fetching 6 ChIP-seq experiments ...
kept 20 ATAC/DNAse-seq experiments
0 lines added to /mnt/workspace/rwiegan/git/jlu-bda-2020/data/download/linking_table.csv
creating queue ...
No new files to download.
No new data was downloaded, skipping validation, merging and sorting.
Reading in linkage table.
Now starting normalisation process. 0 files will be normalised.
------ Log scaling files ------
------ Finding global min/max values ------
------ Min-max scaling all files -------
Warning: No files were normalised. Please check logging for further information.
Traceback (most recent call last):
  File "bin/tf_analyzer.py", line 261, in <module>
    main()
  File "bin/tf_analyzer.py", line 234, in main
    scores, exist = scripts.score.findarea(args.width, args.genome.lower(), [x.lower() for x in args.biosource],
  File "/mnt/workspace/rwiegan/git/jlu-bda-2020/bin/scripts/score.py", line 33, in findarea
    atacdict = pickle.load(open(os.path.join(picklepath, genom, 'atac-seq', biosource + ".pickle"), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/workspace/rwiegan/git/jlu-bda-2020/data/pickledata/mm9/atac-seq/liver.pickle'

I also tried the --check_local_files parameter.

python bin/tf_analyzer.py -g mm9 -b liver -t gata4 polr2a -c chr1 --check_local_files /mnt/workspace/rwiegan/git/jlu-bda-2020/data/
fetching 20 ATAC/DNAse-seq experiments ...
fetching 6 ChIP-seq experiments ...
kept 20 ATAC/DNAse-seq experiments
0 lines added to /mnt/workspace/rwiegan/git/jlu-bda-2020/data/download/linking_table.csv
creating queue ...
No new files to download.
No new data was downloaded, skipping validation, merging and sorting.
Reading in linkage table.
Now starting normalisation process. 0 files will be normalised.
------ Log scaling files ------
------ Finding global min/max values ------
------ Min-max scaling all files -------
Warning: No files were normalised. Please check logging for further information.
Traceback (most recent call last):
  File "bin/tf_analyzer.py", line 261, in <module>
    main()
  File "bin/tf_analyzer.py", line 234, in main
    scores, exist = scripts.score.findarea(args.width, args.genome.lower(), [x.lower() for x in args.biosource],
  File "/mnt/workspace/rwiegan/git/jlu-bda-2020/bin/scripts/score.py", line 33, in findarea
    atacdict = pickle.load(open(os.path.join(picklepath, genom, 'atac-seq', biosource + ".pickle"), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/workspace/rwiegan/git/jlu-bda-2020/data/pickledata/mm9/atac-seq/liver.pickle'

fctsfrmspc commented 3 years ago

--redo_file_validation is the correct argument to use here, as I mentioned before. You may also need to update your conda environment with the latest environment.yml, since we added a different version of rename with the util-linux dependency.

rwiegan commented 3 years ago

Ah, I missed this parameter. With --redo_file_validation it worked.

hschult commented 3 years ago

Ok, the error seems to be fixed now. But still, there are some warnings and errors you want to look at. I prepared a summary below.

No new files to download.
validating files
Error - overlapping regions in bedGraph line 8 of /mnt/workspace/hschult/jlu-bda-2020/data/temp/ENCFF001YAM.chr5.bedgraph
Error - overlapping regions in bedGraph line 53 of /mnt/workspace/hschult/jlu-bda-2020/data/temp/ENCFF001YAM.chr7.bedgraph
Error - overlapping regions in bedGraph line 17 of /mnt/workspace/hschult/jlu-bda-2020/data/temp/ENCFF001YAM.chr10.bedgraph
Error - overlapping regions in bedGraph line 14 of /mnt/workspace/hschult/jlu-bda-2020/data/temp/ENCFF001YAM.chr11.bedgraph
Error - overlapping regions in bedGraph line 5 of /mnt/workspace/hschult/jlu-bda-2020/data/temp/ENCFF001YAM.chr13.bedgraph
...
unrecognized file format
unexpected file: ENCFF002ADS.chr1 chr1
unrecognized file format
unexpected file: ENCFF002ADS.chr2 chr2
unrecognized file format
unexpected file: ENCFF002ADS.chr3 chr3
unrecognized file format
unexpected file: ENCFF002ADS.chr4 chr4
unrecognized file format
unexpected file: ENCFF002ADS.chr5 chr5
unrecognized file format
unexpected file: ENCFF002ADS.chr6 chr6
unrecognized file format
unexpected file: ENCFF002ADS.chr7 chr7
unrecognized file format
....
sorting files
{'ctcf', 'polr2a', 'gata4'}
{'liver'}
Reading in linkage table.
Now starting normalisation process. 456 files will be normalised.
------ Log scaling files ------
- Log-scaling file 1 of 456: /mnt/workspace/hschult/jlu-bda-2020/data/mm9/liver/chip-seq/ctcf/ENCFF001YAM.chr1.bw
- Log-scaling file 2 of 456: /mnt/workspace/hschult/jlu-bda-2020/data/mm9/liver/chip-seq/ctcf/ENCFF001YAM.chr2.bw
- Log-scaling file 3 of 456: /mnt/workspace/hschult/jlu-bda-2020/data/mm9/liver/chip-seq/ctcf/ENCFF001YAM.chr3.bw
- Log-scaling file 4 of 456: /mnt/workspace/hschult/jlu-bda-2020/data/mm9/liver/chip-seq/ctcf/ENCFF001YAM.chr4.bw
- Log-scaling file 5 of 456: /mnt/workspace/hschult/jlu-bda-2020/data/mm9/liver/chip-seq/ctcf/ENCFF001YAM.chr5.bw
[bwHdrRead] There was an error while reading in the header!
[pyBwOpen] bw is NULL!
- File /mnt/workspace/hschult/jlu-bda-2020/data/mm9/liver/chip-seq/ctcf/ENCFF001YAM.chr5.bw could not be normalized. Please check logging for further info.
- Log-scaling file 6 of 456: /mnt/workspace/hschult/jlu-bda-2020/data/mm9/liver/chip-seq/ctcf/ENCFF001YAM.chr6.bw
- Log-scaling file 7 of 456: /mnt/workspace/hschult/jlu-bda-2020/data/mm9/liver/chip-seq/ctcf/ENCFF001YAM.chr7.bw
[bwHdrRead] There was an error while reading in the header!
[pyBwOpen] bw is NULL!
- File /mnt/workspace/hschult/jlu-bda-2020/data/mm9/liver/chip-seq/ctcf/ENCFF001YAM.chr7.bw could not be normalized. Please check logging for further info.
...
418 of 456files were successfully normalised. If not all files were normalised, check logging for further information.
[bwHdrRead] There was an error while reading in the header!
[pyBwOpen] bw is NULL!
Unable to open file /mnt/workspace/hschult/jlu-bda-2020/data/mm9/liver/chip-seq/ctcf/ENCFF001YAM.chr5.bw
[bwHdrRead] There was an error while reading in the header!
[pyBwOpen] bw is NULL!
Unable to open file /mnt/workspace/hschult/jlu-bda-2020/data/mm9/liver/chip-seq/ctcf/ENCFF001YAM.chr7.bw
[bwHdrRead] There was an error while reading in the header!
[pyBwOpen] bw is NULL!
Unable to open file /mnt/workspace/hschult/jlu-bda-2020/data/mm9/liver/chip-seq/ctcf/ENCFF001YAM.chr10.bw
[bwHdrRead] There was an error while reading in the header!
[pyBwOpen] bw is NULL!
Unable to open file /mnt/workspace/hschult/jlu-bda-2020/data/mm9/liver/chip-seq/ctcf/ENCFF001YAM.chr11.bw
[bwHdrRead] There was an error while reading in the header!
[pyBwOpen] bw is NULL!
...

07_04_2021_12_00_47_generate_data.log

JonnyCodewalker commented 3 years ago

This issue has been fixed. Wrongly formatted files were able to pass the pipeline.

loosolab / jlu-bda-2020

missing pickle file #40