Issue on the merge of all chromosome in the post processing part.

Manuel-Derrien commented 2 years ago

Hello,

I'm trying to merge the .h5 file to output plots for the whole genome. I've seen your explanation on how output a single output file and try to read density but I don't really understand the regex format we need to input. python examples/general_read_density_data.py CLEANED_GENE_ANNOTATION.tsv DENSITY_DATA_FOLDER "Arabidopsis_(.*?).h5" Could you provide a more explicit way to input all the .h5 file at once or merge them. I especially don't understand the '"Arabidopsis_(.*?).h5"' part.

Hope what I said is clear enough, feel free to ask more information about it.

All the best.

sjteresi commented 2 years ago

Hello Manuel,

I have seen your issue and will respond more in-depth on Thursday (7/14) as I am currently on vacation with family. Thank you for using my tool and I hope I can be of help. Sorry for the delay in response!

Manuel-Derrien commented 2 years ago

Hello Scott,

Don't worry about this delay and enjoy your vacation !

sjteresi commented 2 years ago

Hello Manuel,

For this issue please reference the scripts transposon/density_data.py and examples/general_read_density_data.py. The primary way you can import/read individual or multiple .h5 files is through the classmethod from_list_gene_data_and_hdf5_dir() in transposon/density_data.py. This is the function being used at the very end of general_read_density_data.py. If you check out that class method in the code there should be more documentation, I will try to summarize below.

Regex Format: The regex format needs to correspond to the name of the .h5 files that are the output of TE Density and deliver the pseudomolecule ID as a group. There is more documentation in the classmethod docstring, but basically the regex string you supply to the code should be able to recognize the unique chromosome/pseudomolecule ID of the .h5 data. For example, on my computer, the Arabidopsis output files are "Arabidopsis_Chr1.h5", "Arabidopsis_Chr2.h5", "ArabidopsisChr3.h5" etc. Thus, when you supply "Arabidopsis(.*?).h5" as a regex string, it grabs the substring "Chr1", Chr2", etc as part of the python regex match object. There is more documentation on python regex here.
Supplying multiple .h5 files: As previously mentioned, the classmethod from_list_gene_data_and_hdf5_dir() should be the way you read one, or multiple .h5 files. Currently I have no plans to implement a merge of the output data for each pseudomolecule. That is to say, when you read you read the data (initialize the DensityData class) you receive an instance of DensityData for each pseudomolecule in your dataset. The processed_dd_data object at the end of general_read_density_data.py is a Python list of DensityData instances, each instance corresponds to one pseudomolecule of TE Density data.
I will add more documentation to the general_read_density_data.py and from_list_gene_data_and_hdf5_dir() functions over the coming days to include some of the explanations I have written above.
Finally, if you tell me the structure of your .h5 data (the general format of the filenames) I should be able to assist you in creating the correct regex string to supply as an input argument. I.e the place where you provide your own custom "Arabidopsis_(.*?).h5"

Manuel-Derrien commented 2 years ago

Hello Scott,

First, thank you for this complete Answer, the format of my file is like "paternal_scaffold_1" , "paternal_scaffold_2" ... When I input python3 /mnt/c/Users/manu/Desktop/A._analyse/TE_density/TE_Density-master/general_read_density_data.py /mnt/c/Users/manu/Desktop/A._analyse/workspace_n1/Cleaned_paternal_imprinted.tsv /mnt/c/Users/manu/Desktop/A._analyse/results_paternal "paternal_scaffold_(.*?).h5" I have this error message : 2022-07-15 10:07:08 laptop __main__[158] INFO Successfully imported the preprocessed gene annotation information: /mnt/c/Users/manu/Desktop/A._analyse/workspace_n1/Cleaned_paternal_imprinted.tsv Traceback (most recent call last): File "/mnt/c/Users/manu/Desktop/A._analyse/TE_density/TE_Density-master/general_read_density_data.py", line 70, in <module> processed_dd_data = DensityData.from_list_gene_data_and_hdf5_dir( File "/mnt/c/Users/manu/Desktop/A._analyse/TE_density/TE_Density-master/transposon/density_data.py", line 590, in from_list_gene_data_and_hdf5_dir [x.group(1) for x in chromosome_ids_unprocessed_h5_files] File "/mnt/c/Users/manu/Desktop/A._analyse/TE_density/TE_Density-master/transposon/density_data.py", line 590, in <listcomp> [x.group(1) for x in chromosome_ids_unprocessed_h5_files] AttributeError: 'NoneType' object has no attribute 'group'

I think that the issue is the way I input my files but I' m not sure of it. All the best,

sjteresi commented 2 years ago

Hi Manuel,

Please try doing this on a new branch I just uploaded: d/additional_user_help_density_data

I added some log info statements to provide more information during run time as to how the pseudomolecule ID is matching with the output .h5 file. That should help users see better if their regex is not producing the needed string to initialize DensityData.

And I think the regex pattern you need to supply is "paternal_(.*?).h5"

Manuel-Derrien commented 2 years ago

Alright, Thanks a lot !

sjteresi / TE_Density

Issue on the merge of all chromosome in the post processing part. #109