sjteresi / TE_Density

Python script calculating transposable element density for all genes in a genome. Publication: https://mobilednajournal.biomedcentral.com/articles/10.1186/s13100-022-00264-4
GNU General Public License v3.0
30 stars 4 forks source link

chi_squared_density_comparison.py not running #122

Closed amvarani closed 2 years ago

amvarani commented 2 years ago

Hi there I'm trying to run the chi_squared_density_comparison.py script, but got the error below.

2022-11-01 10:56:29 grandiflorum __main__[40093] INFO import of preprocessed gene annotation... success!
2022-11-01 10:56:29 grandiflorum __main__[40093] INFO 
            Using the user's provided regex string 'TgrandC1074_(.*?).h5' to match file
            objects and identify the proper pseudomolecule group for
            each file. Regex group 1 of this string must correspond to
            a pseudomolecule. This is needed to initialize DensityData.
            The user should verify that the pseudomolecule IDs derived from
            the GeneData correspond to the groups derived from the filename
            of the output .h5 data.

2022-11-01 10:56:29 grandiflorum __main__[40093] INFO Pseudomolecule from GeneData is chr1, Regex group 1 of <re.Match object; span=(70, 89), match='TgrandC1074_chr1.h5'> is chr1
2022-11-01 10:56:29 grandiflorum __main__[40093] INFO Pseudomolecule from GeneData is chr10, Regex group 1 of <re.Match object; span=(70, 90), match='TgrandC1074_chr10.h5'> is chr10
2022-11-01 10:56:29 grandiflorum __main__[40093] INFO Pseudomolecule from GeneData is chr2, Regex group 1 of <re.Match object; span=(70, 89), match='TgrandC1074_chr2.h5'> is chr2
2022-11-01 10:56:29 grandiflorum __main__[40093] INFO Pseudomolecule from GeneData is chr3, Regex group 1 of <re.Match object; span=(70, 89), match='TgrandC1074_chr3.h5'> is chr3
2022-11-01 10:56:29 grandiflorum __main__[40093] INFO Pseudomolecule from GeneData is chr4, Regex group 1 of <re.Match object; span=(70, 89), match='TgrandC1074_chr4.h5'> is chr4
2022-11-01 10:56:29 grandiflorum __main__[40093] INFO Pseudomolecule from GeneData is chr5, Regex group 1 of <re.Match object; span=(70, 89), match='TgrandC1074_chr5.h5'> is chr5
2022-11-01 10:56:29 grandiflorum __main__[40093] INFO Pseudomolecule from GeneData is chr6, Regex group 1 of <re.Match object; span=(70, 89), match='TgrandC1074_chr6.h5'> is chr6
2022-11-01 10:56:29 grandiflorum __main__[40093] INFO Pseudomolecule from GeneData is chr7, Regex group 1 of <re.Match object; span=(70, 89), match='TgrandC1074_chr7.h5'> is chr7
2022-11-01 10:56:29 grandiflorum __main__[40093] INFO Pseudomolecule from GeneData is chr8, Regex group 1 of <re.Match object; span=(70, 89), match='TgrandC1074_chr8.h5'> is chr8
2022-11-01 10:56:29 grandiflorum __main__[40093] INFO Pseudomolecule from GeneData is chr9, Regex group 1 of <re.Match object; span=(70, 89), match='TgrandC1074_chr9.h5'> is chr9
2022-11-01 10:56:29 grandiflorum __main__[40093] CRITICAL The strings of chromosomes in your unprocessed
                hdf5 files: ['chr1', 'chr10', 'chr10_GeneData', 'chr10_TEData', 'chr10_overlap', 'chr1_GeneData', 'chr1_TEData', 'chr1_overlap', 'chr2', 'chr2_GeneData', 'chr2_TEData', 'chr2_overlap', 'chr3', 'chr3_GeneData', 'chr3_TEData', 'chr3_overlap', 'chr4', 'chr4_GeneData', 'chr4_TEData', 'chr4_overlap', 'chr5', 'chr5_GeneData', 'chr5_TEData', 'chr5_overlap', 'chr6', 'chr6_GeneData', 'chr6_TEData', 'chr6_overlap', 'chr7', 'chr7_GeneData', 'chr7_TEData', 'chr7_overlap', 'chr8', 'chr8_GeneData', 'chr8_TEData', 'chr8_overlap', 'chr9', 'chr9_GeneData', 'chr9_TEData', 'chr9_overlap', 'nameless_revision_cache', 'order_revision_cache', 'superfam_revision_cache'], identified using your supplied
                regex pattern: 'TgrandC1074_(.*?).h5', do not match the
                chromosomes in the GeneData: ['chr1', 'chr10', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9'].
Traceback (most recent call last):
  File "../TE_Density/chi_squared_density_comparison.py", line 210, in <module>
    processed_dd_data = DensityData.from_list_gene_data_and_hdf5_dir(
  File "/home/amvarani/Cupuacu/FIX-Assemblies/FINAL/TE_Density/TE_Density/transposon/density_data.py", line 656, in from_list_gene_data_and_hdf5_dir
    raise ValueError
ValueError
sjteresi commented 2 years ago

Hello Alessandro,

I think the problem here is that the folder you are supplying as an argument to from_list_gene_data_and_hdf5_dir also contains the GeneData, TEData, and overlap h5 files. It needs to be a folder of just the standard output alone. I.e chr1.h5 etc.

I will modify the docstring and error message of the from_list_gene_data_and_hdf5_dir function to be more specific for future users but I won't be able to get to it until later next week or so.

amvarani commented 2 years ago

Thanks a lot! I realized that and fixed it! The script is now working!

I have another question: In your manuscript, you stated something like this regarding figure 2:

"However these observed density differences in upstream and downstream values are not significantly different based on a chi-square test (χ2 = 0.0056; p-value ≥0.9). "

When I run the chi_squared_density_comparison.py, the results provided are a tsv file contaning two columns (see below)

How can I calculate the p-value and x2 values ?

For instance:

chr4_Up chr4_Down 0.0 0.0 0.0 0.0 0.0 0.62637365 0.62637365 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.994006 0.13886113 0.2977023 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.22277722 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

sjteresi commented 2 years ago

Hello Alessandro,

I apologize, I wish I could be more help here. That chi_squared_density_comparison.py code is messy, as it was done right before the holiday season last year at the request of a reviewer and I never got around to instituting a better method. I envisioned that user's would implement their own statistical analysis of these values in R or Python, this is something you will need to do on your own here.

That being said, I would suggest you check out the scipy module for their stats packages if you plan to conduct your analysis in Python. They have good documentation on their functions.

sjteresi commented 2 years ago

Hello Alessandro,

May I close this issue?