sjteresi / TE_Density

Python script calculating transposable element density for all genes in a genome. Publication: https://mobilednajournal.biomedcentral.com/articles/10.1186/s13100-022-00264-4
GNU General Public License v3.0
30 stars 4 forks source link

compare_expression.py #123

Closed amvarani closed 6 months ago

amvarani commented 2 years ago

How to change the compare_expression.py to plot LTR, Copia, Gypsy and others ? The current script only plot TIRs

sjteresi commented 2 years ago

Hello Alessandro,

Line 199 of examples/Blueberry_Expression/src/compare_expression.py contains the subsetting command where I was plotting TIRs. You can change that if you want. However I have made some improvements since then on how I access the data, more on that further down...

More importantly the compare expression script was somewhat heavily customized for the purposes of the application examples in the publication. It may be a better idea to begin with the general_read_density_data.py script and build from there, borrowing as you need from compare_expression.py. I did not intend for the compare expression script to serve as a master script on analyzing expression, merely just a rough guide.

Now onto how I would approach a research question of relating TE Density to gene expression. Here is the general skeleton of how I would do that:

  1. Initialize DensityData objects (this is per pseudomolecule) for your genome
  2. Read in your expression matrix.
  3. Create a dataframe of some sort where you have a gene and its TE density value for a specific window/TE combination
  4. Graph the new table

The tough part is number 3.

I am currently working on refactoring the helper code to make analysis of the output data more user friendly and less painful. Using v.1.0.1 as a reference, I would suggest you investigate the add_te_vals_to_gene_info_pandas and add_hdf5_indices_to_gene_data methods in DensityData. These will help you with step 3. I will hopefully have more to share here soon (my plan is for the next two weeks).

sjteresi commented 2 years ago

Hello Alessandro,

I am still working on refactoring the helper code to make things easier for users, however I have something available right now that I think would be much more helpful to you than the existing code. Go ahead and check out the linked branch f/refactor_density_data_analysis. In there I have modified the blueberry compare_expression.py script to be more user-friendly (and am working on the other example codes too). You should be able to mimic this general paradigm for other analyses so it should be of use to you even if you aren't doing TE Density and gene expression analyses.

There are ~2 caveats to this branch. Since the code is based off of > v.2.0.0 you will need to re-generate your TE Density data before you proceed. Old output .h5 data won't work with the new DensityData reader due to us updating the h5py package. Additionally, this branch may change a little bit when I rebase to master in the next two weeks or so. So you'll want to update your version number and force pull from master at that point. I'll be sure to alert you.

I appreciate your patience in this matter, I don't want the data to be a pain in the butt to analyze; and the blueberry analysis has been at the top of my list for modifying how it is performed. I really wanted to deprecate the yield_all_slices method which was only used in the blueberry data...

amvarani commented 2 years ago

Hi there Thanks! I will test it! However, I got the compare_expression.py plotting LTR, Copia, Gypsy and others, just changing one line.

For example, on Line 107:

    te_columns_to_graph = ["Copia"]  # NB container to store string names, doesn't
    # matter that it is overwritten between loops, the names will be the
    # same for the last loop

In general, I got all scripts working and plotted the results. I'm using your paper as a reference to interpret the results, and I'm liking it very much!