sjteresi / TE_Density

Python script calculating transposable element density for all genes in a genome. Publication: https://mobilednajournal.biomedcentral.com/articles/10.1186/s13100-022-00264-4
GNU General Public License v3.0
30 stars 4 forks source link

OSError: Unable to open file #130

Closed zmz1988 closed 1 year ago

zmz1988 commented 1 year ago

Hello, thanks a lot for providing this nice package! I recently tried it, but got the error as below. I can't figure out where could be wrong. If you could point me out where to check, I would be really appreciate it!

My input files are generated by using your helper scripts import_Arabidopsis_gene_anno.py and import_Arabidopsis_EDTA.py, as my TE file and the gene annotation file fit in how you parse the file in the help scripts.

Then I used the suggested command python /home/TE_Density/process_genome.py Cleaned_Co_new_annotation_clean1_cleanTag.tsv Cleaned_Co.fasta.mod.EDTA.TEanno.gff3.tsv Co_genome -c /home/TE_Density/config/production_run_config.ini -n 5 -o Co_TE_density to perform the TE density analysis. However, I got errors as below:

(omit some parts of the screen output, which is generated by using --reset_h5 after the first failed trial)

2023-03-16 18:04:55 baddiel.local PreProcessor[31967] INFO overwrite: /multi_omics_analysis/TE_density/filtered_input_data/input_h5_cache/Co_genome_Chr1_RagTag_polished_TEData.h5
2023-03-16 18:04:55 baddiel.local __main__[31967] INFO preprocessed 7 files to TE_density/filtered_input_data/input_h5_cache
2023-03-16 18:04:55 baddiel.local __main__[31967] INFO preprocessing... complete
2023-03-16 18:04:55 baddiel.local __main__[31967] INFO process overlap...
2023-03-16 18:04:55 baddiel.local OverlapManager[31967] INFO output overlap data to /multi_omics_analysis/TE_density/tmp/overlap
process     : 0it [00:00, ?it/s]
genes       : 0it [00:00, ?it/s]
2023-03-16 18:04:56 baddiel.local __main__[31967] INFO processed 7 overlap jobs
2023-03-16 18:04:56 baddiel.local __main__[31967] INFO process overlap... complete
2023-03-16 18:04:56 baddiel.local __main__[31967] INFO process density
Traceback (most recent call last):
  File "/home/TE_Density/transposon/overlap.py", line 318, in _open_existing_file
    self._h5_file = h5py.File(cfg.filepath, "r")
  File "/miniconda3/envs/TE_density/lib/python3.10/site-packages/h5py/_hl/files.py", line 533, in __init__
    fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
  File "/miniconda3/envs/TE_density/lib/python3.10/site-packages/h5py/_hl/files.py", line 226, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 106, in h5py.h5f.open
OSError: Unable to open file (truncated file: eof = 96, sblock->base_addr = 0, stored_eof = 2048)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/TE_Density/process_genome.py", line 313, in <module>
    n_subsets = sum(calc_merge_number_operations(job) for job in jobs)
  File "/home/TE_Density/process_genome.py", line 313, in <genexpr>
    n_subsets = sum(calc_merge_number_operations(job) for job in jobs)
  File "/home/TE_Density/process_genome.py", line 99, in calc_merge_number_operations
    with overlap_data as overlap_input:
  File "/home/TE_Density/transposon/overlap.py", line 252, in __enter__
    self.start()
  File "/home/TE_Density/transposon/overlap.py", line 232, in start
    self._open_dispatcher()
  File "/home/TE_Density/transposon/overlap.py", line 306, in _open_dispatcher
    self._open_existing_file(self._config)
  File "/home/TE_Density/transposon/overlap.py", line 320, in _open_existing_file
    raise ValueError(cfg.filepath)
ValueError: /multi_omics_analysis/TE_density/tmp/overlap/Co_genome_Chr1_RagTag_polished_overlap.h5
zmz1988 commented 1 year ago

In case you would like to see the packages

(TE_density) $ conda list --name TE_density
# packages in environment at /miniconda3/envs/TE_density:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
attrs                     22.1.0                   pypi_0    pypi
black                     22.10.0                  pypi_0    pypi
bzip2                     1.0.8                h7f98852_4    conda-forge
ca-certificates           2022.12.7            ha878542_0    conda-forge
click                     8.1.3                    pypi_0    pypi
coloredlogs               15.0.1                   pypi_0    pypi
contourpy                 1.0.6                    pypi_0    pypi
cycler                    0.11.0                   pypi_0    pypi
exceptiongroup            1.0.4                    pypi_0    pypi
fonttools                 4.38.0                   pypi_0    pypi
h5py                      3.7.0                    pypi_0    pypi
humanfriendly             10.0                     pypi_0    pypi
iniconfig                 1.1.1                    pypi_0    pypi
kiwisolver                1.4.4                    pypi_0    pypi
ld_impl_linux-64          2.40                 h41732ed_0    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-ng                 12.2.0              h65d4601_19    conda-forge
libgomp                   12.2.0              h65d4601_19    conda-forge
libnsl                    2.0.0                h7f98852_0    conda-forge
libsqlite                 3.40.0               h753d276_0    conda-forge
libuuid                   2.32.1            h7f98852_1000    conda-forge
libzlib                   1.2.13               h166bdaf_4    conda-forge
matplotlib                3.6.2                    pypi_0    pypi
mypy-extensions           0.4.3                    pypi_0    pypi
ncurses                   6.3                  h27087fc_1    conda-forge
numexpr                   2.8.4                    pypi_0    pypi
numpy                     1.23.5                   pypi_0    pypi
openssl                   3.1.0                h0b41bf4_0    conda-forge
packaging                 21.3                     pypi_0    pypi
pandas                    1.5.2                    pypi_0    pypi
pathspec                  0.10.2                   pypi_0    pypi
pillow                    9.3.0                    pypi_0    pypi
pip                       23.0.1             pyhd8ed1ab_0    conda-forge
platformdirs              2.5.4                    pypi_0    pypi
pluggy                    1.0.0                    pypi_0    pypi
pyparsing                 3.0.9                    pypi_0    pypi
pytest                    7.2.0                    pypi_0    pypi
python                    3.10.9          he550d4f_0_cpython    conda-forge
python-dateutil           2.8.2                    pypi_0    pypi
pytz                      2022.6                   pypi_0    pypi
readline                  8.1.2                h0f457ee_0    conda-forge
scipy                     1.9.3                    pypi_0    pypi
setuptools                67.6.0             pyhd8ed1ab_0    conda-forge
six                       1.16.0                   pypi_0    pypi
tables                    3.7.0                    pypi_0    pypi
tk                        8.6.12               h27826a3_0    conda-forge
tomli                     2.0.1                    pypi_0    pypi
tqdm                      4.64.1                   pypi_0    pypi
typing-extensions         3.7.4.3                  pypi_0    pypi
tzdata                    2022g                h191b570_0    conda-forge
wcwidth                   0.1.7                    pypi_0    pypi
wheel                     0.40.0             pyhd8ed1ab_0    conda-forge
wrapt                     1.11.2                   pypi_0    pypi
xz                        5.2.6                h166bdaf_0    conda-forge
sjteresi commented 1 year ago

Hi, Thank you for the detailed report.

The first error with check nulls arises from Python not knowing where to search for modules. I suggest you look at section 6.1.2 here on the python documentation pertaining to module and PYTHONPATH.

Second, can you please try running the system test? If you are on the most recent version of the master branch, you should be able to run make system_test. If that works we should be able to move forward easily from there. I suspect that it is either a package issue related to installing from conda, or an h5 file that wasn't properly written from a previously failed run (and it is trying to be re-opened). In the case of the latter, I would suggest deleting all H5 files and starting once more.

Did the revision step work OK?

zmz1988 commented 1 year ago

Thanks for replying me so quick! Yes, you're right! It's the old failed H5 files that caused the problems. So I deleted the whole output folder, and ran again. The run was successful! Thanks a lot!

Also, I manage to make the check_nulls work as well, by adding sys.path.append('/absolute_path_to_github_downloads/TE_Density/') before the imports in the helper script. Just share it here in case other python beginners like me don't know how to solve it.

Thanks again! :)

zmz1988 commented 1 year ago

sorry, it's me again. I hope it's ok to ask another question here?

I ran into this unmatching problem of psudoumolecule names between H5 files and the GeneData file, but I don't really see why they don't match. Could you give me some advice please? Thanks a lot in advance!

(TE_density) $ python /TE_Density/examples/general_read_density_data.py Cleaned_Co_new_annotation_clean1_cleanTag.tsv Co_TE_density "Co_genome_(.*?).h5"
2023-03-17 11:39:22 baddiel.local __main__[12346] INFO import of preprocessed gene annotation... success!
2023-03-17 11:39:22 baddiel.local __main__[12346] INFO 
            Using the user's provided regex string 'Co_genome_(.*?).h5' to match file
            objects and identify the proper pseudomolecule group for
            each file. Regex group 1 of this string must correspond to
            a pseudomolecule. This is needed to initialize DensityData.
            The user should verify that the pseudomolecule IDs derived from
            the GeneData correspond to the groups derived from the filename
            of the output .h5 data.

2023-03-17 11:39:22 baddiel.local __main__[12346] INFO Pseudomolecule from GeneData is Chr1_RagTag_polished, Regex group 1 of <re.Match object; span=(77, 111), match='Co_genome_Chr1_RagTag_polished.h5'> is Chr1_RagTag_polished
2023-03-17 11:39:22 baddiel.local __main__[12346] INFO Pseudomolecule from GeneData is Chr2_RagTag_polished, Regex group 1 of <re.Match object; span=(77, 111), match='Co_genome_Chr2_RagTag_polished.h5'> is Chr2_RagTag_polished
2023-03-17 11:39:22 baddiel.local __main__[12346] INFO Pseudomolecule from GeneData is Chr3_RagTag_polished, Regex group 1 of <re.Match object; span=(77, 111), match='Co_genome_Chr3_RagTag_polished.h5'> is Chr3_RagTag_polished
2023-03-17 11:39:22 baddiel.local __main__[12346] INFO Pseudomolecule from GeneData is Chr4_RagTag_polished, Regex group 1 of <re.Match object; span=(77, 111), match='Co_genome_Chr4_RagTag_polished.h5'> is Chr4_RagTag_polished
2023-03-17 11:39:22 baddiel.local __main__[12346] INFO Pseudomolecule from GeneData is Chr5_RagTag_polished, Regex group 1 of <re.Match object; span=(77, 111), match='Co_genome_Chr5_RagTag_polished.h5'> is Chr5_RagTag_polished
2023-03-17 11:39:22 baddiel.local __main__[12346] INFO Pseudomolecule from GeneData is ChrC_RagTag_polished, Regex group 1 of <re.Match object; span=(77, 111), match='Co_genome_ChrC_RagTag_polished.h5'> is ChrC_RagTag_polished
2023-03-17 11:39:22 baddiel.local __main__[12346] INFO Pseudomolecule from GeneData is ChrM_RagTag_polished, Regex group 1 of <re.Match object; span=(77, 111), match='Co_genome_ChrM_RagTag_polished.h5'> is ChrM_RagTag_polished
2023-03-17 11:39:22 baddiel.local __main__[12346] CRITICAL The strings of chromosomes in your unprocessed
                hdf5 files: ['Chr1_RagTag_polished', 'Chr1_RagTag_polished_GeneData', 'Chr1_RagTag_polished_TEData', 'Chr1_RagTag_polished_overlap', 'Chr2_RagTag_polished', 'Chr2_RagTag_polished_GeneData', 'Chr2_RagTag_polished_TEData', 'Chr2_RagTag_polished_overlap', 'Chr3_RagTag_polished', 'Chr3_RagTag_polished_GeneData', 'Chr3_RagTag_polished_TEData', 'Chr3_RagTag_polished_overlap', 'Chr4_RagTag_polished', 'Chr4_RagTag_polished_GeneData', 'Chr4_RagTag_polished_TEData', 'Chr4_RagTag_polished_overlap', 'Chr5_RagTag_polished', 'Chr5_RagTag_polished_GeneData', 'Chr5_RagTag_polished_TEData', 'Chr5_RagTag_polished_overlap', 'ChrC_RagTag_polished', 'ChrC_RagTag_polished_GeneData', 'ChrC_RagTag_polished_TEData', 'ChrC_RagTag_polished_overlap', 'ChrM_RagTag_polished', 'ChrM_RagTag_polished_GeneData', 'ChrM_RagTag_polished_TEData', 'ChrM_RagTag_polished_overlap'], identified using your supplied
                regex pattern: 'Co_genome_(.*?).h5', do not match the
                chromosomes in the GeneData: ['Chr1_RagTag_polished', 'Chr2_RagTag_polished', 'Chr3_RagTag_polished', 'Chr4_RagTag_polished', 'Chr5_RagTag_polished', 'ChrC_RagTag_polished', 'ChrM_RagTag_polished'].
Traceback (most recent call last):
  File "/TE_Density/examples/general_read_density_data.py", line 80, in <module>
    processed_dd_data = DensityData.from_list_gene_data_and_hdf5_dir(
  File "/TE_Density/transposon/density_data.py", line 666, in from_list_gene_data_and_hdf5_dir
    raise ValueError
ValueError
sjteresi commented 1 year ago

Hi,

Since it is looking for any .h5 files and it seems to be recognizing the TE Data and Gene Data files, I would suggest you place only the density data output files (not the gene, overlap, or TE data files) in a directory of their own. And then try re-running. I think that should fix it.

If it does or doesn't please let me know. I'll make a note to improve the error message for a future release.

zmz1988 commented 1 year ago

Yes, it's fixed now as you suggested. I previously ran general_read_density_data.py directly on the output folder of process_genome.py, and thought that there should be no other files in the folder than the outputted density data .h5 (but of course there are other .h5 files in the filtered_input_data and the tmp folder). Now I created a new directory and set the links to only the density output .h5 file, then everything runs smoothly.

So embarrassing that I didn't think to check the tmp folder and filtered_input_data, otherwise I could solve this by myself. Thanks again for your kind help!

sjteresi commented 1 year ago

It is totally OK! I'm glad I could help. It is useful to see where people get stuck, so I can write better error messages.

Good luck with your project!

zmz1988 commented 1 year ago

Thanks a lot!