seb-mueller / chlamy_locus_map

Small RNA Locus Map for Chlamydomonas reinhardtii
GNU General Public License v3.0
1 stars 0 forks source link

Phasing tool to use #10

Closed nmatthews323 closed 5 years ago

nmatthews323 commented 6 years ago

Need to decide alternative phasing tool. This paper from 2017 compares available tools: https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbx136/4558649

PhaseTank looks like a good candidate: "PhaseTank [53] implements a slightly different methodology. After defining phased clusters that contain at least fourphased sRNA in 84 nt regions, a nonstatistical phased score is computed to express the chance of a region to be a producer of phased siRNA. This score depends on patterns of sRNA distribution and abundance in the region. The triggering sRNA is then determined following sequence complementarity principles along with the fact that the cleavage site must occur at positions 9–11 nt of the sRNA from its 50 terminal."

I'll look at downloading and creating a script for a test run of PhaseTank.

nmatthews323 commented 5 years ago

So, for using PhaseTank, we need a script a bit like this: perl PhaseTank_v1.0_mod.pl --genome /projects/nick_matthews/resources/Creinhardtii_236.fa --lib /projects/nick_matthews/sequencing_data/SL16/SL16.v_Chlamydomonas_reinhardtii_transcript-v5.3.patman.redundant_aligned_reads.fasta.gz,/projects/nick_matthews/sequencing_data/SL17/SL17.v_Chlamydomonas_reinhardtii_transcript-v5.3.patman.redundant_aligned_reads.fasta.gz,/projects/nick_matthews/sequencing_data/SL60/SL60.v_Chlamydomonas_reinhardtii_transcript-v5.3.patman.redundant_aligned_reads.fasta.gz,/projects/nick_matthews/sequencing_data/SL74/SL74.v_Chlamydomonas_reinhardtii_transcript-v5.3.patman.redundant_aligned_reads.fasta.gz,/projects/nick_matthews/sequencing_data/SL75/SL75.v_Chlamydomonas_reinhardtii_transcript-v5.3.patman.redundant_aligned_reads.fasta.gz,/projects/nick_matthews/sequencing_data/SL76/SL76.v_Chlamydomonas_reinhardtii_transcript-v5.3.patman.redundant_aligned_reads.fasta.gz The paths to library files need to be seperated by commas, and need to be unzipped before they are passed to PhaseTank (not yet implemented). PhaseTank_v1.0_mod.pl is their script with one modification so it can extract counts from our libraries.

The filenames I think we need to use are added to "Summary_of_data.csv", with the column heading "PhasingFiles", to start with I think we should just start with the wt controls as identified by Adrian ($controls=="wt").

seb-mueller commented 5 years ago

I've pulled your code into the repo and it seems to work indeed, at least trying it with one library as a test:

base=/projects/nick_matthews
libs=${base}/sequencing_data
perl PhaseTank_v1.0_mod.pl --genome ${base}/resources/Creinhardtii_236.fa --lib <(zcat ${libs}/SL16/SL16.v_Chlamydomonas_reinhardtii_transcript-v5.3.patman.redundant_aligned_reads.fasta.gz)
seb-mueller commented 5 years ago

Turns out it didn't work.. I'll unzip the files and we go with your version! Update: I've uncompressed all files into /projects/nick_matthews/sequencing_data/fasta_all.

nmatthews323 commented 5 years ago

OK, thanks Seb. Now I guess we try running with all the WTs on the cluster. Do you want to submit this or shall I? I don't know a way of doing without just assembling a very long bash command..

seb-mueller commented 5 years ago

I was about to run it and created a scaffold file already, but wanted to double check some. run_PhaseTank.r

https://github.com/seb-mueller/chlamy_locus_map/blob/f3406feca23b87b8e251e1311714f94888bcfc1b/Scripts/run_PhaseTank.r#L22

Before I run it I wanted to double check we are using the right reference. According to your initial message, you used --genome Creinhardtii_236.fa i.e. the genome as ref. How do you relate this back to the loci? Which output are you using for processing? Pred_tab_*? Or should we use the loci themselfs as reference, i.e. create a fasta file with an entry for each loci?

nmatthews323 commented 5 years ago

At the moment PhaseTank just looks for phased locations within the genome, irrespective of our locus map, I then compute overlaps between the phased regions identified by PhaseTank and our locus map:

https://github.com/seb-mueller/chlamy_locus_map/blob/f3406feca23b87b8e251e1311714f94888bcfc1b/Scripts/chlamy_source_code.R#L788

And yes I'm using the Preftab file for processing:

https://github.com/seb-mueller/chlamy_locus_map/blob/f3406feca23b87b8e251e1311714f94888bcfc1b/Scripts/chlamy_source_code.R#L773

We could certainly use a fasta file of our locus map as a reference. It would change the way we relate the loci. We need to make sure we have a clear locus naming convention so we can relate them to each other after phasing has.

To be honest, I think both approaches work. Using the fasta of the loci is a bit more rigorous, but I would imagine would result in the same (or very similar) output.

seb-mueller commented 5 years ago

Nice work! I suppose that should do the trick. I'll push on then and let you know when it's done. To double check we can always check a few of loci on IGV later.

seb-mueller commented 5 years ago

I've came across a whole different array of problems with insconsitent/empty fasta files. Upon solving it I know can't run PhaseTank anymore, it's driving me crazy and don't know why that is. I've created a test fasta file tmp.fa. Do you have any idea why this doesn't work?

sm934@node13.plantsci.internal.cam.ac.uk/projects/nick_matthews/chlamy_locus_map_github/PhaseTank_v1.0‹master*› ☺➤perl PhaseTank_v1.0_mod.pl --genome /projects/nick_matthews/resources/Creinhardtii_236.fa --lib tmp.fa
############################## PhaseTank version1.0 ##############################

Run time: 2018.11.27_16:32:25

#Input files:

        Rerference data: /projects/nick_matthews/resources/Creinhardtii_236.fa
        Read file:
                (1): tmp.fa
##Begin to check files at 2018.11.27_16:32:25..

        Checking /projects/nick_matthews/resources/Creinhardtii_236.fa...
        Checking tmp.fa...
##Finshed checking files

#Begin to normalize library at 2018.11.27_16:32:26..

#Begin to merge libraries at 2018.11.27_16:32:26..

        Use of uninitialized value in addition (+) at PhaseTank_v1.0_mod.pl line 554, <LIB> line 2.
lines:1         abun:0
Use of uninitialized value $tmp_hash{"GAAGGAGGACGGTCACGCCCAA"} in division (/) at PhaseTank_v1.0_mod.pl line 566, <LIB> line 2.
Illegal division by zero at PhaseTank_v1.0_mod.pl line 566, <LIB> line 2.
nmatthews323 commented 5 years ago

Will take a look!

nmatthews323 commented 5 years ago

Updating on this:

Phasetank needs the count information for each read, in our libraries this comes after "count:" so I just needed to tweak the main script so it correctly splits the first line of each fasta read to extract the counts information, this file is on the github "PhaseTank_v1.0_mod.pl".

I therefore adopted the script to reflect this so instead of splitting by "_x": https://github.com/seb-mueller/chlamy_locus_map/blob/451e3c4fd7e647b240c52986bfe8086632898b09/PhaseTank_v1.0/PhaseTank_v1.0.pl#L553

It now splits by "count:": https://github.com/seb-mueller/chlamy_locus_map/blob/451e3c4fd7e647b240c52986bfe8086632898b09/PhaseTank_v1.0/PhaseTank_v1.0_mod.pl#L553

This action happens right at the start of the process - it merges libraries before building the bowtie index. NOTE: when it rebuilds the new merged libraries it uses the original "_x" naming convention is used: https://github.com/seb-mueller/chlamy_locus_map/blob/451e3c4fd7e647b240c52986bfe8086632898b09/PhaseTank_v1.0/PhaseTank_v1.0_mod.pl#L590 This means that there is no need to proliferate the "count:" edit throughout the script as the fasta files we pass it are converted over to their format when the libraries are merged.

I couldn't find the bowtie typo, could you link it?

seb-mueller commented 5 years ago

Really impressed you managed to read and edit perl code! I couldn't find a decription that PhaseTank needs count anywhere, not even in there example: https://github.com/seb-mueller/chlamy_locus_map/blob/451e3c4fd7e647b240c52986bfe8086632898b09/PhaseTank_v1.0/README#L76

Anyway, there is no script to convert fasta into the count format directly. But I found a resonable search string:

ls **/*assembly5_Chlamydomonas_reinhardtii.patman.aligned_reads.fasta.gz
SL139/SL139.non_redundant.v_genome_JGI_assembly5_Chlamydomonas_reinhardtii.patman.aligned_reads.fasta.gz
SL140/SL140.non_redundant.v_genome_JGI_assembly5_Chlamydomonas_reinhardtii.patman.aligned_reads.fasta.gz
SL141/SL141.non_redundant.v_genome_JGI_assembly5_Chlamydomonas_reinhardtii.patman.aligned_reads.fasta.gz
SL142/SL142.non_redundant.v_genome_JGI_assembly5_Chlamydomonas_reinhardtii.patman.aligned_reads.fasta.gz
SL143/SL143.non_redundant.v_genome_JGI_assembly5_Chlamydomonas_reinhardtii.patman.aligned_reads.fasta.gz
SL144/SL144.non_redundant.v_genome_JGI_assembly5_Chlamydomonas_reinhardtii.patman.aligned_reads.fasta.gz
SL145/SL145.non_redundant.v_genome_JGI_assembly5_Chlamydomonas_reinhardtii.patman.aligned_reads.fasta.gz
SL146/SL146.non_redundant.v_genome_JGI_assembly5_Chlamydomonas_reinhardtii.patman.aligned_reads.fasta.gz
SL14/SL14.non_redundant.v_genome_JGI_assembly5_Chlamydomonas_reinhardtii.patman.aligned_reads.fasta.gz
SL15/SL15.non_redundant.v_genome_JGI_assembly5_Chlamydomonas_reinhardtii.patman.aligned_reads.fasta.gz
nmatthews323 commented 5 years ago

Works running this script:

perl PhaseTank_v1.0_mod.pl --genome /projects/nick_matthews/resources/Creinhardtii_236.fa --lib /projects/nick_matthews/resources/SL140.non_redundant.v_genome_JGI_assembly5_Chlamydomonas_reinhardtii.patman.aligned_reads.fasta

seb-mueller commented 5 years ago

It worked in the end! Results and details are in https://github.com/seb-mueller/chlamy_locus_map/tree/master/Scripts/PhaseTank_OUTPUT_2018.11.27_18.08

Script used is: https://github.com/seb-mueller/chlamy_locus_map/blob/master/Scripts/run_PhaseTank.r

It found 142 phasing cluster, this is the last one: https://github.com/seb-mueller/chlamy_locus_map/blob/9d55776787071653c61eaaabf324509f266268e0/Scripts/PhaseTank_OUTPUT_2018.11.27_18.08/Pred_tab_2018.11.27_18.08#L143

Seems ok to me, could you have another look?

nmatthews323 commented 5 years ago

Excellent! Not a huge number of phased locations, but I guess as you said we don't even know if this is significant.

I'll re-run the annotation pipeline as soon as possible including this info, then we can discuss if we want to modify any of the settings based on the density plots.