novoalab / EpiNano

Detection of RNA modifications from Oxford Nanopore direct RNA sequencing reads (Liu*, Begik* et al., Nature Comm 2019)
GNU General Public License v2.0
109 stars 31 forks source link

Epinano with controls? #130

Closed Whatsacb closed 1 year ago

Whatsacb commented 1 year ago

Hello! I am currently curious what is recommended to use as a control for detecting m6A mRNA modifications? I have been using EpiNano 1.2 to look for m6A modifications on direct mRNA sequenced reads from a Nanopore MinION, which have been basecalled with Guppy 3.1.5. We have also added to the sample fully methylated and fully unmethylated m6A site mRNAs (g. luciferase and c. luciferase) using the kit here (https://www.epigentek.com/catalog/epiquik-m6a-rna-methylation-quantification-kit-colorimetric-p-3646.html). When I run epinano_variants.py on .bam files that have been isolated to contain just the gluc / cluc sequences, I get a response similar to the error reported here (https://www.epigentek.com/catalog/epiquik-m6a-rna-methylation-quantification-kit-colorimetric-p-3646.html). If I try adding in gluc / cluc to my reference genome assembly during the basecall as individual "chromosomes" and run that data through epinano, I get results from most of the data, but gluc and cluc are not found within the dataset. I am using the sam2tsv file provided by epinano.

Any ideas? Is there another, better control that is recommended? Is it possible that epinano is throwing out the gluc / cluc data when I run epinano_variants.py? Thanks for your help!

enovoa commented 1 year ago

Hello @Whatsacb, the link that you say as "error" is not an error/issue, but a link to a kit. Regarding controls, what we'd typically recommend/use if you're interested in m6A would be to use METTL3 KO (if in vivo samples) or unmodified synthetic molecules.

enovoa commented 1 year ago

About the references, Epinano should not be throwing out the gluc/cluc data, you can see in the examples that we have used it in synthetic sequences as well (e.g. curlcakes, etc) or customized references (eg. 25s rRNA, as in the demo/test set). Please try out the examples of the repo first and see if those work on your hands. Thanks!

Whatsacb commented 1 year ago

Hi enovoa,

Thanks for your reply, I really appreciate the help. I will try out the examples of the repo and report back.

Regarding controls, what we'd typically recommend/use if you're interested in m6A would be to use METTL3 KO (if in vivo samples) or unmodified synthetic molecules.

We can't use a knockout because these are patient derived human samples that have been cultured, but perhaps in the future we can treat the samples with FTO, or something else, that might serve as a better control.

About the references, Epinano should not be throwing out the gluc/cluc data, you can see in the examples that we have used it in synthetic sequences as well (e.g. curlcakes, etc) or customized references (eg. 25s rRNA, as in the demo/test set).

Perhaps this is just ignorance on my part, but in those customized reference cases, how do you go about mapping them prior to running EpiNano? What I have been doing thus far is adding the cDNA sequences for gluc/cluc to the beginning of the GRCh38 reference fasta file from the NIH as "chromosomes" like so using vim in terminal:

>gluc
GGAGACCCAAG...
>cluc
GGAGACCCAAG...
(rest of the original GRCh38 .fa reference file text here)

Then, I map my mRNA nanopore reads to the genome using Guppy 3.1.5/minimap2 via master of pores 2's preprocess script. The aligned .bam reads obtained from this I then sort out by "chromosome" using samtools for just "gluc" or "cluc", and running Epinano on that. It seems to work for chromosomes (like "chr19"), but not when sorting for "gluc" or "cluc". The result of running it on gluc/cluc sorted bam files is the following output:

/home/tj/MOP2/output/mp_output/FAM95931_wGCluc_sub/alignment/gluc_TMP_
already exists, will overwrite it
Process Process-2:
Traceback (most recent call last):
   File
"/home/tj/miniconda3/envs/epinano/lib/python3.7/multiprocessing/process.py",
line 297, in _bootstrap
     self.run()
   File
"/home/tj/miniconda3/envs/epinano/lib/python3.7/multiprocessing/process.py",
line 99, in run
     self._target(*self._args, **self._kwargs)
   File "EpiNano/Epinano_Variants.py", line 45, in
split_tsv_for_per_site_var_freq
     firstline = next (tsv)
StopIteration

Then it just hangs. I realize this might be a bit outside of just EpiNano issues, but any help or direction you can provide me will be greatly appreciated. Thank you so much!

Best, TJ

enovoa commented 1 year ago

Hi @Whatsacb, EpiNano_Variants.py function will compute some metrics, but it won't give you the differential modified sites. Basically, EpiNano can be ran in two modes, as described in the README: EpiNano-SVM (which works for single samples, but then was false positives), and EpiNano-Error (which requires paired samples). I'm not sure which of the two modes you are trying to run, but please note that you will not be able to predict differential sites in EpiNano-Error unless you have two files to compare.

To troubleshoot, you can map your reads ONLY onto your gluc/cluc references. Perhaps you are not adding correctly the references manually, no idea. You should also visualize your mapped reads to see if you are getting any mapped reads, in case this is the issue.

As mentioned, please try the repo examples, the rRNA mapping example is a customized example like the gluc/cluc case.

Whatsacb commented 1 year ago

Hi @enovoa, thanks for your help. I am currently trying to run EpiNano-SVM, since I want to run it across single samples (hence, why I'm basecalling my raw data with guppy 3.1.5). I am able to run it and get an output with my data aligned to the human genome, just not for my RNA reads.

I tried running EpiNano as you suggested using your test data (the wt.bam file and the reference "ref.fa"), using the following command:

python Epinano_Variants.py -n 6 \ -R test_data/make_predictions/ref.fa \ -b test_data/make_predictions/wt.bam \ -s misc/sam2tsv.jar --type g

Which generates the following:

test_data/make_predictions/wt_TMP_ already exists, will overwrite it
test_data/make_predictions/wt_TMP_ already exists, will overwrite it
Process Process-16:
Traceback (most recent call last):
  File "/home/tj/miniconda3/envs/epinano/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/tj/miniconda3/envs/epinano/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "Epinano_Variants.py", line 45, in split_tsv_for_per_site_var_freq
    firstline = next (tsv)
StopIteration

Then nothing. My current conda package install list is as follows:

# packages in environment at /home/tj/miniconda3/envs/epinano:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
biopython                 1.79             py37h540881e_2    conda-forge
blas                      1.1                    openblas    conda-forge
bokeh                     2.4.3              pyhd8ed1ab_3    conda-forge
bwa                       0.7.17               h7132678_9    bioconda
bzip2                     1.0.8                h7f98852_4    conda-forge
c-ares                    1.18.1               h7f98852_0    conda-forge
ca-certificates           2022.9.24            ha878542_0    conda-forge
click                     8.1.3            py37h89c1867_0    conda-forge
cloudpickle               2.2.0              pyhd8ed1ab_0    conda-forge
curl                      7.86.0               h7bff187_1    conda-forge
cytoolz                   0.12.0           py37h540881e_0    conda-forge
dask                      2.5.2                      py_0    conda-forge
dask-core                 2.5.2                      py_0  
distributed               2.7.0                      py_0  
eigen                     3.4.0                h4bd325d_0    conda-forge
freetype                  2.12.1               hca18f0e_0    conda-forge
fsspec                    2022.11.0          pyhd8ed1ab_0    conda-forge
h5py                      2.10.0          nompi_py37h513d04c_102    conda-forge
hdf5                      1.10.5          nompi_h7c3c948_1111    conda-forge
heapdict                  1.0.1                      py_0    conda-forge
htslib                    1.9                  h4da6232_3    bioconda
importlib-metadata        4.11.4           py37h89c1867_0    conda-forge
jinja2                    3.1.2              pyhd8ed1ab_1    conda-forge
jpeg                      9e                   h166bdaf_2    conda-forge
k8                        0.2.5                hd03093a_2    bioconda
keyutils                  1.6.1                h166bdaf_0    conda-forge
krb5                      1.19.3               h3790be6_0    conda-forge
lcms2                     2.14                 h6ed2654_0    conda-forge
ld_impl_linux-64          2.39                 hcc3a1bd_1    conda-forge
lerc                      4.0.0                h27087fc_0    conda-forge
libcurl                   7.86.0               h7bff187_1    conda-forge
libdeflate                1.13                 h166bdaf_0    conda-forge
libedit                   3.1.20191231         h46ee950_2    conda-forge
libev                     4.33                 h516909a_1    conda-forge
libffi                    3.2.1             he1b5a44_1007    conda-forge
libgcc-ng                 12.2.0              h65d4601_19    conda-forge
libgfortran-ng            7.5.0               h14aa051_20    conda-forge
libgfortran4              7.5.0               h14aa051_20    conda-forge
libgomp                   12.2.0              h65d4601_19    conda-forge
libnghttp2                1.47.0               hdcd2b5c_1    conda-forge
libnsl                    2.0.0                h7f98852_0    conda-forge
libpng                    1.6.38               h753d276_0    conda-forge
libsqlite                 3.40.0               h753d276_0    conda-forge
libssh2                   1.10.0               haa6b8db_3    conda-forge
libstdcxx-ng              12.2.0              h46fd767_19    conda-forge
libtiff                   4.4.0                h0e0dad5_3    conda-forge
libwebp-base              1.2.4                h166bdaf_0    conda-forge
libxcb                    1.13              h7f98852_1004    conda-forge
libzlib                   1.2.13               h166bdaf_4    conda-forge
llvm-openmp               8.0.1                hc9558a2_0    conda-forge
locket                    1.0.0              pyhd8ed1ab_0    conda-forge
markupsafe                2.1.1            py37h540881e_1    conda-forge
minimap2                  2.24                 h7132678_1    bioconda
msgpack-python            1.0.4            py37h7cecad7_0    conda-forge
nanopolish                0.12.4               h7185e64_0    bioconda
ncurses                   6.1               hf484d3e_1002    conda-forge
numpy                     1.16.2          py37_blas_openblash1522bff_0  [blas_openblas]  conda-forge
openblas                  0.3.3             h9ac9557_1001    conda-forge
openjdk                   8.0.332              h166bdaf_0    conda-forge
openjpeg                  2.5.0                h7d73246_1    conda-forge
openmp                    8.0.1                         0    conda-forge
openssl                   1.1.1s               h166bdaf_0    conda-forge
packaging                 21.3               pyhd8ed1ab_0    conda-forge
pandas                    0.23.4           py37h04863e7_0  
partd                     1.3.0              pyhd8ed1ab_0    conda-forge
perl                      5.32.1          2_h7f98852_perl5    conda-forge
pillow                    9.2.0            py37h850a105_2    conda-forge
pip                       22.3.1             pyhd8ed1ab_0    conda-forge
psutil                    5.9.3            py37h540881e_0    conda-forge
pthread-stubs             0.4               h36c2ea0_1001    conda-forge
pyparsing                 3.0.9              pyhd8ed1ab_0    conda-forge
pysam                     0.19.1           py37hee149a5_1    bioconda
python                    3.7.8           h6f2ec95_1_cpython    conda-forge
python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
python_abi                3.7                     2_cp37m    conda-forge
pytz                      2022.6             pyhd8ed1ab_0    conda-forge
pyyaml                    6.0              py37h540881e_4    conda-forge
readline                  8.0                  h46ee950_1    conda-forge
samtools                  0.1.19               hf89b575_7    bioconda
scikit-learn              0.20.2          py37_blas_openblashebff5e3_1400  [blas_openblas]  conda-forge
scipy                     1.2.1           py37_blas_openblash1522bff_0  [blas_openblas]  conda-forge
setuptools                65.5.1             pyhd8ed1ab_0    conda-forge
six                       1.16.0             pyh6c4a22f_0    conda-forge
sortedcontainers          2.4.0              pyhd8ed1ab_0    conda-forge
sqlite                    3.32.3               hcee41ef_1    conda-forge
tblib                     1.7.0              pyhd8ed1ab_0    conda-forge
tk                        8.6.12               h27826a3_0    conda-forge
toolz                     0.12.0             pyhd8ed1ab_0    conda-forge
tornado                   6.2              py37h540881e_0    conda-forge
typing_extensions         4.4.0              pyha770c72_0    conda-forge
wheel                     0.38.4             pyhd8ed1ab_0    conda-forge
xorg-libxau               1.0.9                h7f98852_0    conda-forge
xorg-libxdmcp             1.1.3                h7f98852_0    conda-forge
xz                        5.2.6                h166bdaf_0    conda-forge
yaml                      0.2.5                h7f98852_2    conda-forge
zict                      2.2.0              pyhd8ed1ab_0    conda-forge
zipp                      3.10.0             pyhd8ed1ab_0    conda-forge
zlib                      1.2.13               h166bdaf_4    conda-forge
zstd                      1.5.2                h6239696_4    conda-forge

I tried to follow the package requirements as closely as I could, since EpiNano doesn't work with some of the newer packages (like scikit-learn). Also, I renamed "ref.dict" to "ref.fa.dict" in the test data folder, since it was asking for it formatted that way.

Thanks again for all your help! I'll keep troubleshooting while I wait for your response.

Huanle commented 1 year ago

Hi @Whatsacb ,

Can you try running

samtools view -h -F 3860 your.bam | java -jar  /path/to/sam2tsv.jar -r your.reference.fasta

and tell me what you get?

Whatsacb commented 1 year ago

Hi Huanle,

Thanks for helping out. I found out that the sequences I was trying to parse through EpiNano actually weren't in the original .bam files (i.e., there were no mapped reads), so that likely was the culprit for my original problem. If I ran EpiNano on .bam files with verified reads that I was able to open in IGV, EpiNano ran through all the way.

However, I still am having the same issue running the test data wt.bam with the included ref.fa in ~/EpiNano/test_data/make_predictions/ as discussed in my previous post. If I run samtools view -h -F 3860 wt.bam | java -jar /path/to/sam2tsv.jar -r ref.fa on the test files, I get the following:

.
.
.
9ef4eac8-95ad-4e10-b873-4b61d64632a8    0   25s 124 T   #   3452    .   S
9ef4eac8-95ad-4e10-b873-4b61d64632a8    0   25s 125 A   $   3453    .   S
9ef4eac8-95ad-4e10-b873-4b61d64632a8    0   25s 126 T   $   3454    .   S
9ef4eac8-95ad-4e10-b873-4b61d64632a8    0   25s 127 A   #   3455    .   S
9ef4eac8-95ad-4e10-b873-4b61d64632a8    0   25s 128 T   %   3456    .   S
9ef4eac8-95ad-4e10-b873-4b61d64632a8    0   25s 129 T   )   3457    .   S
9ef4eac8-95ad-4e10-b873-4b61d64632a8    0   25s 130 A   %   3458    .   S
9ef4eac8-95ad-4e10-b873-4b61d64632a8    0   25s 131 T   %   3459    .   S
9ef4eac8-95ad-4e10-b873-4b61d64632a8    0   25s 132 C   (   3460    .   S
9ef4eac8-95ad-4e10-b873-4b61d64632a8    0   25s 133 C   #   3461    .   S
9ef4eac8-95ad-4e10-b873-4b61d64632a8    0   25s 134 A   "   3462    .   S
9ef4eac8-95ad-4e10-b873-4b61d64632a8    0   25s 135 T   #   3463    .   S
9ef4eac8-95ad-4e10-b873-4b61d64632a8    0   25s 136 C   #   3464    .   S
9ef4eac8-95ad-4e10-b873-4b61d64632a8    0   25s 137 A   %   3465    .   S
9ef4eac8-95ad-4e10-b873-4b61d64632a8    0   25s 138 T   %   3466    .   S
9ef4eac8-95ad-4e10-b873-4b61d64632a8    0   25s 139 C   '   3467    .   S
9ef4eac8-95ad-4e10-b873-4b61d64632a8    0   25s 140 C   *   3468    .   S
9ef4eac8-95ad-4e10-b873-4b61d64632a8    0   25s 141 T   #   3469    .   S
9ef4eac8-95ad-4e10-b873-4b61d64632a8    0   25s 142 C   #   3470    .   S
9ef4eac8-95ad-4e10-b873-4b61d64632a8    0   25s 143 A   #   3471    .   S
9ef4eac8-95ad-4e10-b873-4b61d64632a8    0   25s 144 C   &   3472    .   S
9ef4eac8-95ad-4e10-b873-4b61d64632a8    0   25s 145 C   #   3473    .   S
9ef4eac8-95ad-4e10-b873-4b61d64632a8    0   25s 146 T   "   3474    .   S
9ef4eac8-95ad-4e10-b873-4b61d64632a8    0   25s 147 G   $   3475    .   S
9ef4eac8-95ad-4e10-b873-4b61d64632a8    0   25s 148 A   #   3476    .   S
9ef4eac8-95ad-4e10-b873-4b61d64632a8    0   25s 149 T   $   3477    .   S
[INFO][Sam2Tsv]. Completed. N=13,759. That took:2 minutes

Thanks for all your help!

enovoa commented 1 year ago

Hi @Whatsacb, Sorry for slow reply - i see your first issue was solved but the second one wasn't - was this sorted out? Thanks, Eva

Whatsacb commented 1 year ago

Thanks for the reply. I can't remember if I got the test data to work or not, but I did get EpiNano to work, so it was all good. I can't remember exactly what the issue was, but if someone stumbles across this in the future I think I just remade my conda environment and followed the package versions more closely. Hope it helps!