Reading step 2 output - Githubissues

bryanmkevan commented 4 years ago

What do I do with this list of sites in the csv file after Step 2? Is there a way to map these Kmers to a genome?

Huanle commented 4 years ago

Hi @bryanmkevan, 1) You can already predict modifications with these csv files and you can further manipulate them for other purposes as you wish. There are some examples in https://github.com/enovoa/EpiNano/blob/master/test_data/make_predictions/

2) The positions/windows (1-based) in these files are positions in the reference sequences, genomic or transcriptomic, depending on your type of reference you used to align reads and generate bam file(s).

3) if (you are interested in training your own models and you know which sites are (un-) modified in your samples), you can follow instruction commands that can be found in https://github.com/enovoa/EpiNano/blob/master/test_data/train_models/train_test.sh

Best Regards, Huanle

bryanmkevan commented 4 years ago

I made a mistake, but I also meant after Step 3. I didn't select the per-read option for Epinano_Variants.py, so I didn't get the per-read file, and was trying to map out a strategy for backing out the 5-mers to reads.

bryanmkevan commented 4 years ago

There's also a bug in the logic of the temp directory. I believe this line here, around 260 in Epinano_Variants.py needs some work:

var_files = df_proc (tmp_dir, prefix)

I think there is a problem that if the temporary directory doesn't already have files in it at the start of the program, it will not set one of these prefixes correctly and the code will not run. I get an error:

Traceback (most recent call last): File "Epinano_Variants.py", line 271, in <module> slide_per_site_var(var_files[0]) File "epinano/EpiNano/epinano_modules.py", line 789, in slide_per_site_var fh = open (per_site_var, 'rb' ) FileNotFoundError: [Errno 2] No such file or directory: 'alignment/WT.minimap2.sorted.plus_strand.per.site.var.csv'

This is because the The files are small_0.freq, etc. I ran the code once, it didn't work but it generated the temp files. Next time the code got through this got through the logic gate because the temp directory was already full of files.

Huanle commented 4 years ago

I made a mistake, but I also meant after Step 3. I didn't select the per-read option for Epinano_Variants.py, so I didn't get the per-read file, and was trying to map out a strategy for backing out the 5-mers to reads.

You'd like to map per read kmer and reference kmer? Am i following you? If so, the per read option should be able to give you that. For the moment, we are still trying to improve per read analysis. So far, the examples in STEP3 are on per site basis.

Huanle commented 4 years ago

There's also a bug in the logic of the temp directory. I believe this line here, around 260 in Epinano_Variants.py needs some work:

var_files = df_proc (tmp_dir, prefix)

I think there is a problem that if the temporary directory doesn't already have files in it at the start of the program, it will not set one of these prefixes correctly and the code will not run. I get an error:

Traceback (most recent call last): File "Epinano_Variants.py", line 271, in <module> slide_per_site_var(var_files[0]) File "epinano/EpiNano/epinano_modules.py", line 789, in slide_per_site_var fh = open (per_site_var, 'rb' ) FileNotFoundError: [Errno 2] No such file or directory: 'alignment/WT.minimap2.sorted.plus_strand.per.site.var.csv'

This is because the The files are small_0.freq, etc. I ran the code once, it didn't work but it generated the temp files. Next time the code got through this got through the logic gate because the temp directory was already full of files.

thanks for pointing this out. I will have a look at it.

bryanmkevan commented 4 years ago

Yes, I am still having issues with the per-read code. This error in particular:

totally processed 281007 reads totally processed 0 reads Process Process-65: Traceback (most recent call last): File "epinano/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "epinano/lib/python3.6/multiprocessing/process.py", line 93, in run self._target(*self._args, **self._kwargs) File "epinano/EpiNano/epinano_modules.py", line 1223, in slide_per_read_var_multiprocessing firstLine = read_var_lines[0] IndexError: list index out of range

I think this has to do with an empty interative Manager().Queue() object in epinano_modules.py. The program gets through the (~line 292 EpiNano_Variants.py)

Process (target = split_tsv_for_per_read_var, args = (tsv_gen, q, args.threads))

command just fine, as seen by the first read totally processed 281007 reads, but doesn't get through the command (~line 308 EpiNano_Variants.py)

Process (target = split_reads_for_per_read_var_sliding , args = (per_read_var,q,number_threads))

a few lines later as indicated by totally processed 0 reads. The command uses q.put which should shake out to something being passed to

Process (target = slide_per_read_var_multiprocessing, args = (q, output))

and through that to line 1223 in epinano_modules.py but that q argument isn't getting filled somehow.

Huanle commented 3 years ago

Hi @bryanmkevan ,

Can you offer me with your data (bam file and reference fasta) so that i can have a test? I tried with my own data from small synthetic dataset to large eukaryotic dataset. I was not able to run into the error you came across. Thanks in advance.

bryanmkevan commented 3 years ago

Hi Huanle, I hope this finds you well. I'm just following up on this a few months later. How is the development of the per-read version of EpiNano going? Is it close to being complete?

Huanle commented 3 years ago

Hi @bryanmkevan ,

Thanks for asking. I did not invest very much time into the per-read mode but my colleague will soon release a tool that can predict modifications at the per-read level better than Epinano per-read mode. That's why i deprecated the per-read mode of Epinano.

That said, if you are keen on per-read level sequencing errors, I can spend some time helping you make it run on your system. Cheers - Huanle

bryanmkevan commented 3 years ago

Hi Huanle,

You had some early version of the per-read code that was on the github previously that I wish I had saved. Could I take a peak at that code again? I don't quite remember where it was. This is more of a cherry-on-top-of-the-cake thing for us, just interested in getting it to run once.

Best, Bryan

On Thu, Jan 14, 2021 at 4:29 AM WHUANLEE notifications@github.com wrote:

Hi @bryanmkevan https://github.com/bryanmkevan ,

Thanks for asking. I did not invest very much time into the per-read mode but my colleague will soon release a tool that can predict modifications at the per-read level better than Epinano per-read mode. That's why i deprecated the per-read mode of Epinano.

That said, if you are keen on per-read level sequencing errors, I can spend some time helping you make it run on your system. Cheers - Huanle

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/enovoa/EpiNano/issues/56#issuecomment-760166158, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALEQD7MPR7ZXQ2AICGT7ZWLSZ3PSXANCNFSM4RCSQFBQ .

--

Bryan Kevan 541-515-3776

Central Asia https://bryanmkevan.bike/2018/08/09/central-asia-epilogue/ and Framebuilding https://bryanmkevan.bike/2020/01/30/bmk-frameworks/

Huanle commented 3 years ago

Hi @bryanmkevan , the per-read function is available for both 1.0 and 1.1. You can download them from releases. But I remember the one reported to have bug is from v1.1. Cheers, Huanle

bryanmkevan commented 3 years ago

Hi Huanle,

I actually sat down and pushed through this today and got it to work. I pulled the version of PER_READ.py from 1.1 as you suggested and changed the indexing of the columns to reflect the .tsv file from 1.2.

Thanks for your hard work, the output looks great!

Huanle commented 3 years ago

glad to know that. let me know if you need further help.

novoalab / EpiNano

Reading step 2 output #56