sjteresi / TE_Density

Python script calculating transposable element density for all genes in a genome. Publication: https://mobilednajournal.biomedcentral.com/articles/10.1186/s13100-022-00264-4
GNU General Public License v3.0
30 stars 4 forks source link

compare_expression #128

Closed jordana-olive closed 5 months ago

jordana-olive commented 1 year ago

Hi Teresi! What an amazing tool for TE studying.

I'm having an issue to run compare_expression.py.

I've run the processe_genome successfully:

python3.10 ../../Programs/TE_Density/process_genome.py --num_threads 5 --config_file ../../Programs/TE_Density/config/production_run_config.ini gene.gtf repeat-masker.gtf Rhiir_chr_ -o ./

Then, I obtained the .h5, and can generate the dotplots properly. Rhiir_chr__Rhiir_chr_1.h5 Rhiir_chr__Rhiir_chr_15.h5 Rhiir_chr__Rhiir_chr_20.h5 Rhiir_chr__Rhiir_chr_26.h5 Rhiir_chr__Rhiir_chr_31.h5 Rhiir_chr__Rhiir_chr_7.h5 Rhiir_chr__Rhiir_chr_10.h5 Rhiir_chr__Rhiir_chr_16.h5 Rhiir_chr__Rhiir_chr_21.h5 Rhiir_chr__Rhiir_chr_27.h5 Rhiir_chr__Rhiir_chr_32.h5 Rhiir_chr__Rhiir_chr_8.h5 Rhiir_chr__Rhiir_chr_11.h5 Rhiir_chr__Rhiir_chr_17.h5 Rhiir_chr__Rhiir_chr_22.h5 Rhiir_chr__Rhiir_chr_28.h5 Rhiir_chr__Rhiir_chr_33.h5 Rhiir_chr__Rhiir_chr_9.h5 Rhiir_chr__Rhiir_chr_12.h5 Rhiir_chr__Rhiir_chr_18.h5 Rhiir_chr__Rhiir_chr_23.h5 Rhiir_chr__Rhiir_chr_29.h5 Rhiir_chr__Rhiir_chr_4.h5 Rhiir_chr__Rhiir_chr_13.h5 Rhiir_chr__Rhiir_chr_19.h5 Rhiir_chr__Rhiir_chr_24.h5 Rhiir_chr__Rhiir_chr_3.h5 Rhiir_chr__Rhiir_chr_5.h5 Rhiir_chr__Rhiir_chr_14.h5 Rhiir_chr__Rhiir_chr_2.h5 Rhiir_chr__Rhiir_chr_25.h5 Rhiir_chr__Rhiir_chr_30.h5 Rhiir_chr__Rhiir_chr_6.h5 I've modified the compare_expression.py to have my own regex pattern, but I'm having this issue: `python3.10 examples/Blueberry_Expression/src/compare_expression2.py --output_dir /output TPM2.tsv annotation.gtf /$path-to-h5-results/results/

2023-03-08 18:36:26 s1703-r-19jg01 main[5001] INFO Using the user's provided regex string 'Rhiir_chr__(.*?).h5' to match file objects and identify the proper pseudomolecule group for each file. Regex group 1 of this string must correspond to a pseudomolecule. This is needed to initialize DensityData. The user should verify that the pseudomolecule IDs derived from the GeneData correspond to the groups derived from the filename of the output .h5 data.

2023-03-08 18:36:26 s1703-r-19jg01 main[5001] CRITICAL The strings of chromosomes in your unprocessed hdf5 files: ['Rhiir_chr_1', 'Rhiir_chr_10', 'Rhiir_chr_11', 'Rhiir_chr_12', 'Rhiir_chr_13', 'Rhiir_chr_14', 'Rhiir_chr_15', 'Rhiir_chr_16', 'Rhiir_chr_17', 'Rhiir_chr_18', 'Rhiir_chr_19', 'Rhiir_chr_2', 'Rhiir_chr_20', 'Rhiir_chr_21', 'Rhiir_chr_22', 'Rhiir_chr_23', 'Rhiir_chr_24', 'Rhiir_chr_25', 'Rhiir_chr_26', 'Rhiir_chr_27', 'Rhiir_chr_28', 'Rhiir_chr_29', 'Rhiir_chr_3', 'Rhiir_chr_30', 'Rhiir_chr_31', 'Rhiir_chr_32', 'Rhiir_chr_33', 'Rhiir_chr_4', 'Rhiir_chr_5', 'Rhiir_chr_6', 'Rhiir_chr_7', 'Rhiir_chr_8', 'Rhiir_chr_9'], identified using your supplied regex pattern: 'Rhiir_chr__(.*?).h5', do not match the chromosomes in the GeneData: []. Traceback (most recent call last): File "/home/jordana/Programs/TE_Density/examples/Blueberry_Expression/src/compare_expression2.py", line 435, in processed_dd_data = DensityData.from_list_gene_data_and_hdf5_dir( File "/home/jordana/Programs/TE_Density/transposon/density_data.py", line 666, in from_list_gene_data_and_hdf5_dir raise ValueError ValueError `

Where can I double-check the IDs derived from GeneData?

sjteresi commented 1 year ago

Hello,

I would suggest doing print(GeneDataobj.chromosomes). Looking at the error message, it appears that your GeneData object doesn't have that information. How did you initialize GeneData in your analysis script?

jordana-olive commented 1 year ago

I changed the line

chromosomes_i_want = ["Rhiir_chr_" + str(i) for i in range(33)] In import_genes...to select my chromosomes instead of the Blueberry genome, and it worked.

Now my problem is to generate the violin plots. I'm having only the histogram, apparently, without an error.

subsetted_tpm_matrix_by_genes[window_direction_type] = density_slice.slice /home/jordana/Programs/TE_Density/examples/Blueberry_Expression/src/compare_expression-MULE.py:121: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of callingframe.insertmany times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, usenewframe = frame.copy() subsetted_tpm_matrix_by_genes[window_direction_type] = density_slice.slice 2023-03-10 09:44:32 s1703-r-19jg01 matplotlib.legend[3737] WARNING No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument. 2023-03-10 09:44:32 s1703-r-19jg01 matplotlib.legend[3737] WARNING No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument. 2023-03-10 09:44:32 s1703-r-19jg01 matplotlib.legend[3737] WARNING No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument. /home/jordana/Programs/TE_Density/examples/Blueberry_Expression/src/compare_expression-MULE.py:380: UserWarning: Attempt to set non-positive ylim on a log-scaled axis will be ignored. plt.clf()

sjteresi commented 1 year ago

Hello,

That appears to be a matplotlib error. I'm not sure how much I can help here, but I will try to assist with some tips. I appreciate you bearing with me as you modify the compare_expression script. That script was meant to be more of a guide than a "insert your data" script. The warning must be a new thing since updating to Python 3.10 and the new packages. You can ignore that for now, as it shouldn't affect the validity of results.. An update to the user interface is on the way, but it is going to be awhile.

That being said, some advice: If you can get your dataframe of genes and their TE density values and gene expression values you should be able to graph everything if you just use the plot_expression_v_density_violion and plot_new_hist functions as guides. In particular, I found artist/label problems in matplotlib to be particularly confusing, and they were hard to diagnose. I would suggest reading this tutorial. Additionally, the histogram code produces a 3 panel figure (and in my experience as soon as the axes/subplots are involved, the likelihood for an artist issue increases), so I would suggest breaking that into a minimum working example. The artist/label issue is probably arising from an issue with the bins (i.e [0.0 0.1) etc) or the code's ability to subset your main data with the hard-coded TE type in the example.

If all else fails, I would suggest that you checkout the branch f/refactor_density_data_analysis. This branch is unfinished, but contains my work refactoring the analysis scripts, in particular the blueberry code saw the most changes. I think this should streamline your analysis, as it makes it easier to construct a pandas dataframe suitable for general plotting. This branch does not change the values of the data, it merely changes the helper functions for accessing the data.

Thank you, Scott

jordana-olive commented 1 year ago

Thank you very much for your attention, Scott.

I will test your suggestions. And also test with other python versions.

Kind regards,