tsailabSJ / circleseq

GNU Affero General Public License v3.0
22 stars 19 forks source link

Misc pipeline updates #31

Closed martinaryee closed 6 years ago

martinaryee commented 6 years ago

Comments from Jose:

I have been testing and making small corrections to the latest version of circleseq that we have in the Martin's fork and I think that is ready to be merged with Shengdar's fork. Some comments about the main changes.

  1. We included a parameter so the user can decide to use the 'standard' chromosomes or all the chromosomes in the reference. It is a logical parameter called "all_chromosomes". If "True", then will search for off-targets in all the chromosomes in the input reference, if "False" then it will look only at the standard chromosomes. The default parameter is 'False". I used string manipulations to defined the set of standard chromosomes, so after the merged we may need to define them in the way that Shengdar did. Now this is processed inside the function "findClevageSites.compare", new lines 525:528.

Another parameters included are: "search_radius" (with default 20) so the user can determined the size of the 'window-sequence', intended for Cpf1 but not tested yet; "merged_analysis" (with default 'True'); and "variant_analysis" (with default False).

  1. Some small changes are the removal of not used modules (as nwalign), non-used parameters (read.threshold) and non-used lines (such as those including p-values that are not being used in the sites filtering).

  2. We noticed an error in how reads are being filtered in the function "findClevageSites.tabulate_start_position", old lines 125:127 and 132:134, where the conditional is written as 'if A and B or C' instead of if A and (B or C)'. This allowed for not adequate reads to be included (more about this later).

  3. The main change is the new 'local' alignment. I recall that now we can return up to two off-target sequences, 'substitutions.only' and 'bulge'. As a consequence we added quite a few columns in the 'matched' tables, so we now have such tables with header. The columns were placed such that the columns are about the 'observed' cleavage window, then about the substitutions.only off-target site, the gaps.allowed off-target site, and general information.

  4. We adjusted the visualization code for the new order of the columns being used. We also made small changes in the look of the visualizations, e.g. adding the word PAM on top of the target sequence.

  5. The code for the variants is in the old file created by Shengdar long time ago for this purpose, "callVariants.pu", and the idea is quite simple: a) use command line to run samtools:pipeline, which creates temporally files that collect the output of each command line and the command lines themselves. In this step we look for all the variants in the windowe sequences for each site, not only at the off-target sequences. If this step runs without errors then these files will be removed and only keep a text file with the all the variants called by samtools and their info (see the attached tables).

Then we replace in each window sequence the 'reference' by the 'variants', independently of the quality score or depth of the variant, and find once more for an off-target sequence in the variant window sequence. If there is a change with respect to the first off-target sequence (difference off-target sequence, different genomic location or none off-target sequence) then we report the new sequence as in the matched table together with information about the variant(s).

We may have that in repetitive regions a variant may cause to get a new off-target sequence without variants, in that case we are intending to report the new sequence but there will be no variants information. On the other hand, if the variants increased the number of mismatches so that the sequence is not longer reported then we will report only the variants information. If both variant off-target sequences have variants then we will collect the variants information by the genomic location and separated by ':'. The variant info that we are collecting is 'reference', 'variant', 'genotype', quality'. We are reporting all the possible variants, so the user will have to double check them (maybe in IGV) and make the final call on what are real variants (as we did in the paper).