sjteresi / TE_Density

Python script calculating transposable element density for all genes in a genome. Publication: https://mobilednajournal.biomedcentral.com/articles/10.1186/s13100-022-00264-4
GNU General Public License v3.0
28 stars 4 forks source link

regex pattern do not match #134

Open SolayMane opened 1 year ago

SolayMane commented 1 year ago

I runt the fellowing command :

 python general_read_density_data.py TE_input/Cleaned_Fusarium_oxysporum_f.sp._albedinis_Foa_44.tsv out_TE_density/ "FOA_(.*?).h5"

I got this error :

The strings of chromosomes in your unprocessed
                hdf5 files: ['Chr_1', 'Chr_10', 'Chr_10_GeneData', 'Chr_10_TEData', 'Chr_10_overlap', 'Chr_11', 'Chr_11_GeneData', 'Chr_11_TEData', 'Chr_11_overlap', 'Chr_12', 'Chr_12_GeneData', 'Chr_12_TEData', 'Chr_12_overlap', 'Chr_13', 'Chr_13_GeneData', 'Chr_13_TEData', 'Chr_13_overlap', 'Chr_1_GeneData', 'Chr_1_TEData', 'Chr_1_overlap', 'Chr_2', 'Chr_2_GeneData', 'Chr_2_TEData', 'Chr_2_overlap', 'Chr_4', 'Chr_4_GeneData', 'Chr_4_TEData', 'Chr_4_overlap', 'Chr_5', 'Chr_5_GeneData', 'Chr_5_TEData', 'Chr_5_overlap', 'Chr_7', 'Chr_7_GeneData', 'Chr_7_TEData', 'Chr_7_overlap', 'Chr_8', 'Chr_8_GeneData', 'Chr_8_TEData', 'Chr_8_overlap', 'Chr_9', 'Chr_9_GeneData', 'Chr_9_TEData', 'Chr_9_overlap', 'Contig1', 'Contig11', 'Contig11_GeneData', 'Contig11_TEData', 'Contig11_overlap', 'Contig12', 'Contig12_GeneData', 'Contig12_TEData', 'Contig12_overlap', 'Contig15', 'Contig15_GeneData', 'Contig15_TEData', 'Contig15_overlap', 'Contig17', 'Contig17_GeneData', 'Contig17_TEData', 'Contig17_overlap', 'Contig18', 'Contig18_GeneData', 'Contig18_TEData', 'Contig18_overlap', 'Contig19', 'Contig19_GeneData', 'Contig19_TEData', 'Contig19_overlap', 'Contig1_GeneData', 'Contig1_TEData', 'Contig1_overlap', 'Contig21', 'Contig21_GeneData', 'Contig21_TEData', 'Contig21_overlap', 'Contig22', 'Contig22_GeneData', 'Contig22_TEData', 'Contig22_overlap', 'Contig23', 'Contig23_GeneData', 'Contig23_TEData', 'Contig23_overlap', 'Contig24', 'Contig24_GeneData', 'Contig24_TEData', 'Contig24_overlap', 'Contig25', 'Contig25_GeneData', 'Contig25_TEData', 'Contig25_overlap', 'Contig26', 'Contig26_GeneData', 'Contig26_TEData', 'Contig26_overlap', 'Contig29', 'Contig29_GeneData', 'Contig29_TEData', 'Contig29_overlap', 'Contig3', 'Contig31', 'Contig31_GeneData', 'Contig31_TEData', 'Contig31_overlap', 'Contig32', 'Contig32_GeneData', 'Contig32_TEData', 'Contig32_overlap', 'Contig33', 'Contig33_GeneData', 'Contig33_TEData', 'Contig33_overlap', 'Contig34', 'Contig34_GeneData', 'Contig34_TEData', 'Contig34_overlap', 'Contig36', 'Contig36_GeneData', 'Contig36_TEData', 'Contig36_overlap', 'Contig38', 'Contig38_GeneData', 'Contig38_TEData', 'Contig38_overlap', 'Contig3_GeneData', 'Contig3_TEData', 'Contig3_overlap', 'Contig4', 'Contig42', 'Contig42_GeneData', 'Contig42_TEData', 'Contig42_overlap', 'Contig46', 'Contig46_GeneData', 'Contig46_TEData', 'Contig46_overlap', 'Contig49', 'Contig49_GeneData', 'Contig49_TEData', 'Contig49_overlap', 'Contig4_GeneData', 'Contig4_TEData', 'Contig4_overlap', 'Contig50', 'Contig50_GeneData', 'Contig50_TEData', 'Contig50_overlap', 'Contig51', 'Contig51_GeneData', 'Contig51_TEData', 'Contig51_overlap', 'Contig52', 'Contig52_GeneData', 'Contig52_TEData', 'Contig52_overlap', 'Contig54', 'Contig54_GeneData', 'Contig54_TEData', 'Contig54_overlap', 'Contig56', 'Contig56_GeneData', 'Contig56_TEData', 'Contig56_overlap', 'Contig59', 'Contig59_GeneData', 'Contig59_TEData', 'Contig59_overlap', 'Contig62', 'Contig62_GeneData', 'Contig62_TEData', 'Contig62_overlap', 'Contig65', 'Contig65_GeneData', 'Contig65_TEData', 'Contig65_overlap', 'Contig67', 'Contig67_GeneData', 'Contig67_TEData', 'Contig67_overlap', 'Contig68', 'Contig68_GeneData', 'Contig68_TEData', 'Contig68_overlap', 'Contig69', 'Contig69_GeneData', 'Contig69_TEData', 'Contig69_overlap', 'Contig7', 'Contig7_GeneData', 'Contig7_TEData', 'Contig7_overlap', 'Contig8', 'Contig8_GeneData', 'Contig8_TEData', 'Contig8_overlap', 'Contig9', 'Contig9_GeneData', 'Contig9_TEData', 'Contig9_overlap'], identified using your supplied
                regex pattern: 'FOA_(.*?).h5', do not match the
                chromosomes in the GeneData: ['Chr_1', 'Chr_10', 'Chr_11', 'Chr_12', 'Chr_13', 'Chr_2', 'Chr_4', 'Chr_5', 'Chr_7', 'Chr_8', 'Chr_9', 'Contig1', 'Contig11', 'Contig12', 'Contig15', 'Contig17', 'Contig18', 'Contig19', 'Contig21', 'Contig22', 'Contig23', 'Contig24', 'Contig25', 'Contig26', 'Contig29', 'Contig3', 'Contig31', 'Contig32', 'Contig33', 'Contig34', 'Contig36', 'Contig38', 'Contig4', 'Contig42', 'Contig46', 'Contig49', 'Contig50', 'Contig51', 'Contig52', 'Contig54', 'Contig56', 'Contig59', 'Contig62', 'Contig65', 'Contig67', 'Contig68', 'Contig69', 'Contig7', 'Contig8', 'Contig9'].
sjteresi commented 1 year ago

Hi,

  1. The folder path that you are pointing should contain only the TE Density output, not the GeneData or TEData` files. I have added a commit a2f8b65 to master to provide a little more guidance here. I would suggest moving some of your files around, and re-trying.
  2. The issue with the regex here indeed seems to be an error. I wrote the post-processor and general_read_density_data.py code to utilize only one regex rule. This is failing here because you have chromosomes/psuedomolecules that start with Chr and some that start with contig. Thank you for bringing this to my attention!. I have updated the master branch to commit d0eadfa (PR #135 ), and heavily modified the general_read_density_data.py script along with adding a new (hopefully better) method to instantiate DensityData. Please check out DensityData.from_list_genedata_dir_and_hdf5_dir(). I am hoping that this is helpful.
  3. Chances are you might be using an old version of TE Density since a lot has changed in the past 2 months. Please update to the latest commit on the master branch. Please make sure you are using Python3.10, and that your packages are updated. You will also need to regenerate your TE Density results because I have introduced changes to the intermediate files in PR #132 which I merged prior to this latest patch
SolayMane commented 1 year ago

Thank you for your collaboration. It works fine. But it is not clear how to investigate the results . Could you add a straightforward tuto on that.