isovic / racon

Ultrafast consensus module for raw de novo genome assembly of long uncorrected reads. http://genome.cshlp.org/content/early/2017/01/18/gr.214270.116 Note: This was the original repository which will no longer be officially maintained. Please use the new official repository here:
https://github.com/lbcb-sci/racon
MIT License
257 stars 48 forks source link

racon removes contig ends with no coverage #126

Open johnomics opened 5 years ago

johnomics commented 5 years ago

Hello,

I'm using racon to polish a close-to-complete genome assembly where most contigs have telomeres on the end. Racon is removing many of these telomeres. Is there any way I can avoid racon removing these sequences? I am already using the -u option.

The table below shows the number of telomere sequences at the end of each contig in the raw (canu) assembly and then each racon iteration. The Lost column shows which iteration of Racon lost most or all of the telomere. (The starts of the contigs look similar.)

Contig          Raw     Racon1  Racon2  Racon3  Racon4  Lost
tig00000037     160     159     163     130     144
tig00000055     128     91      55      0       0       3
tig00000058     56      79      18      0       0       3
tig00000060     142     146     147     133     77
tig00000063     158     29      0       0       0       2
tig00000070     3       2       2       2       1
tig00000082     143     160     151     129     17
tig00000084     61      25      1       0       0       2
tig00000104     154     164     165     165     166
tig00000134     136     148     155     145     76
tig00000158     143     138     86      31      0       4
tig00000182     114     98      85      42      41
tig00000197     124     134     142     144     137
tig00000209     143     152     146     151     149
tig00000218     110     109     44      23      0       4
tig00000238     146     86      88      92      92
tig00003593     49      12      0       0       0       2
tig00003595     142     158     152     130     106
tig00003601     142     149     140     139     140
tig00003605     136     146     116     29      4       4
tig00003607     66      21      0       0       0       2
tig00003608     0       0       0       0       0
tig00306617     138     131     51      7       0       3
tig00306621     154     166     166     165     147

The sequence is definitely being removed from the contig, it is not being polished to some other sequence. For example, here is a rough alignment of each version of the end of tig00000084:

Raw    ...CGACTCACAAGAAAGATATCCCGATGCGAATAAAGTGTGTGTGTCTTGAAAACGTGTGACCATACGCTATCTTCCCAGTCTCCGTAGCCCTCCGTCATTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTGGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTTTAGGAGTTAGGGTTAGG
Racon1 ...CGACTCACAAGAAAGATATCCCGATGCGAATAAAGTGTGTGTGTCTTGAAAACGTGTGACCATACGCTATCTTCCCAGTCTCCGTAGCCCTCCGTCATTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTATTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTTTAGGAG
Racon2 ...CGACTCACAAGAAAGATATCCCGATGCGAATAAAGTGTGTGTGTCTTGAAAACGTGTGACCATACGCTATCTTCCCAGTCTCCGTAGCCCTCCGTCATTAGGGTT
Racon3 ...CGACTCACAAGAAAGATATCCCGATGCGAATAAAGTGTGTGTGTCTTGAAAACGTGTGACCATACGCTATCTTCCCAGTCTCCGTAGCCCTCCGTCATT
Racon4 ...CGACTCACAAGAAAGATATCCCGATGCGAATAAAGTGTGTGTGTCTTGAAAACGTGTGACCATACGCTATCTTCCCAGTCTCCGTAGCCCTCCGTCATT

The PAF files show there are no alignments to the telomeres in the racon1 alignments, which presumably causes the truncation, but I would prefer to retain these sequences for now if possible. Is there a way of doing this?

Many thanks John

rvaser commented 5 years ago

Hi John, you can try with option -f which will enable multiple mappings per read (vs one by default) and hopefully increase the coverage at contig ends. Not sure why it truncates the ends so much. Are you polishing with short or long reads? What coverage do you have?

Best regards, Robert

johnomics commented 5 years ago

Thanks - I'm polishing with 600x coverage of the genome in MinION 9.4.1 reads called with guppy 3.1.5 from a Circulomics library - >50x coverage in >50kb reads.

I'll try -f, but I'm a bit worried about that, as multi-alignments are often inaccurate (the genome has subtelomeres present at all chromosome ends, so a lot of reads pile up at these regions). I think I'd prefer to leave the sequence as unpolished, rather than risk polishing it incorrectly. Maybe leaving unpolished bases as lower case as per https://github.com/isovic/racon/issues/117 would be useful.

But are you saying you wouldn't expect racon to truncate ends without alignments? Would you have expected it to leave the ends intact, even if there were no reads aligning to it?

The trimming seems like defensible behaviour to me - it's quite possible these ends are incorrect - but I'm running Medaka after racon and I can see a few cases where Medaka has reintroduced the telomere. So it would be great to have an option to keep all the end sequences if possible, so I can do further polishing and checking downstream.

Thanks John

rvaser commented 5 years ago

Racon uses a heuristic postprocesing method which truncates the obtained window consensus at its ends depending on the window coverage. Lets say the average coverage of a window is 50, it will trim both ends until a base occurs with at least 25 coverage. This works pretty good, although sometimes the ends of the complete sequences might suffer if they are circular. The heuristic method (and even msa) are not used when a window has too low coverage. So if you whole telomere has low coverage and is almost 500 bp long, it should not be truncated at all. I am not sure why this is happening here though, as you are using long reads and lots of reads should map to both ends. I can write you a hack to not employ the trimming at sequence beginning and end if you want to try.

johnomics commented 5 years ago

Ah! That might explain it to some extent. We have noticed that nanopore only sequences telomeres in one direction - reads are present with telomeres at their ends, but no reads have telomeres at their starts. Something prevents the nanopore from starting at the telomere. So the coverage of telomeres is already half of what it is for the rest of the chromosome.

Also, because very similar subtelomeres occur at every chromosome end, the coverage is often quite high at the contig ends - short reads (as in <2-3kb, not Illumina!) can be aligned with fairly high quality to the 'wrong' subtelomere, because there are plenty of common subsequences.

So it's possible that the average coverage of the end windows is higher due to the subtelomere alignments AND the telomere coverage is lower due to reads having only one direction, so the telomeres get trimmed because they are considerably lower than half the average coverage.

I'll try filtering the reads by length to avoid as much incorrect subtelomeric mapping as possible and see if that improves things. Maybe I could also fake some telomeric reads to get the coverage up as well (although they might not align well if they didn't read through the whole subtelomere, which would probably screw up the polishing for the whole subtelomere...)

Is this truncation ONLY done for the windows at the ends of contigs? Or for every window across the genome?

If it's not too much work, and it will only affect the ends of sequences, I think an option to turn off this truncation would be useful - happy to test it out if you write it.

Thanks John

rvaser commented 5 years ago

Are the numbers in the table total telomere size of both ends or just one of them? Can you see if the beginning is more trimmed than the end?

The trimming is done for each window (of 500bp) in each of the sequences you are polishing. I'll try to add an option for this in couple of hours.

johnomics commented 5 years ago

The numbers are total telomere count, for a 6bp telomere - so for example the tig00000084 contig has lengths something like this:

Telomere count       61  25   1   0   0
Telomere seq length 366 150   6   0   0

Here are the number for the contig starts. I've added a lost column to the above table and here, showing which iteration most of the telomere was lost. There are 4 telomeres lost from the starts of contigs, but 10 from the ends, so it looks like there is some orientation effect here. Could try reversing the contigs and rerunning racon if that would be useful.

Contig          Raw Racon1  Racon2  Racon3  Racon4  Lost
tig00000037 149 163 144 161 159
tig00000055 134 122 121 37  27
tig00000058 1   1   1   1   1
tig00000060 128 141 92  3   0       3
tig00000063 135 165 164 166 105
tig00000070 135 137 62  7   6       3
tig00000082 157 115 84  81  79
tig00000084 132 79  89  79  78
tig00000104 137 165 103 84  84
tig00000134 0   0   0   0   0
tig00000158 84  85  108 66  11
tig00000182 141 149 58  37  1       4
tig00000197 156 164 164 162 163
tig00000209 118 96  99  97  84
tig00000218 107 111 96  91  68
tig00000238 146 142 82  74  78
tig00003593 162 157 162 164 165
tig00003595 150 165 155 82  80
tig00003601 158 157 82  82  82
tig00003605 58  1   0   0   0
tig00003607 147 159 163 163 165
tig00003608 160 129 82  92  56
tig00306617 143 139 40  43  0       4
tig00306621 116 120 127 137 70

(Thanks for offering to look into an option, but no rush - I probably won't get to testing until next week now.)

rvaser commented 5 years ago

@johnomics, the beginning of the first window and the end of the last window are not trimmed anymore on branch feature_no_trim. Checkout to it, compile and run like default racon (I did not add an option to enable this behaviour yet). Sorry for the delay!

rvaser commented 5 years ago

I have also added an option to disable trimming completely (option --no-trim) to the same branch as above.

johnomics commented 5 years ago

Thank you very much for this. I've tested the branch with and without the --no-trimming option. I subsampled the sequence, alignments and reads for tig00000084, using the original PAF file to select tig00000084 alignments with awk, and then extracting the reads from the original read set with seqkit. The files I used are here: https://drive.google.com/open?id=13_yDnPrjy5Qi9KlrD9Gj-h4QW0Nh8zIo

I then ran racon 1.3.3 on these files, then racon from branch feature_no_trim, then branch feature_no_trim with --no-trimming. Here are the results (name, contig length, end sequence):

Raw           587628 ...ATATCCCGATGCGAATAAAGTGTGTGTGTCTTGAAAACGTGTGACCATACGCTATCTTCCCAGTCTCCGTAGCCCTCCGTCATTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTGGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTTTAGGAGTTAGGGTTAGG
Racon 1.3.3   592147 ...ATATCCCGATGCGAATAAAGTGTGTGTGTCTTGAAAACGTGTGACCATACGCTATCTTCCCAGTCTCCGTAGCCCTCCGTCATTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTGGGGTTAGGGTTAGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTGGGGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGAGTTAGGGTTAGG
Racon branch  592259 ...ATATCCCGATGCGAATAAAGTGTGTGTGTCTTGAAAACGTGTGACCATACGCTATCTTCCCAGTCTCCGTAGCCCTCCGTCATTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTGGGGTTAGGGTTAGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTGGGGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGAGTTAGGGTTAGG
--no-trimming 593291 ...ATATCCCGATGCGAATAAAGTGTGTGTGTCTTGAAAACGTGTGACCATACGCTATCTTCCCAGTCTCCGTAGCCCTCCGTCATTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTGGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGGGTTAGGGTTAGGGTTAGGGTTAGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAAGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTGGGGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGAGTTAGGGTTAGG

The resulting FASTAs are in the Google Drive folder above.

Without --no-trimming, the end sequence is exactly the same as before, still trimmed - although the contig itself is 112bp longer. But with --no-trimming, the telomere is actually longer than the raw sequence!

So neither of these options straightforwardly resolve the problem - but do you think the no-trimming result looks reasonable (~1kb added in total)?

Please don't worry about this too much - I'll probably just use 1.3.3 with 2 iterations, as that preserves most of the telomeres. But if you think there's something worth following up on here, I'm happy to assist with testing.

Thanks John

rvaser commented 5 years ago

The problem with poa consensus is that long insertions from beginning and end of the POA graph are added into the consensus due to the heaviest path algorithm. That is why we are trimming both consensus ends until sufficient coverage is found. Example graph is bellow (golden nodes are part of the found consensus), in which the end is a big insertion of low coverage:

sample_poa_graph

It is quite odd that iterations 3 or 4 decrease the length of the telomere so drastically. Can you maybe check if some of your contigs share significant overlaps with each other? The other explanation would be that the majority of small reads that cover telomeres all have their best alignment to a small portion of your contigs which results in insufficient coverage at ends of the other contigs (might be solvable with -f option).

johnomics commented 5 years ago

The contigs definitely share overlaps at the ends, because they are full chromosomes, and they all feature roughly similar subtelomere sequences ~10kb long. This definitely caused small reads to pile up at some locations and artificially increase the coverage, which is probably causing the filtering at the very ends. I've tried filtering to >20kb reads, but there are still pileups of short alignments from those reads, which cause similar telomere truncations (actually it's a bit worse than before). So I'm going to try filtering the alignments themselves and see if that improves things.

cgjosephlee commented 4 years ago

Hi @johnomics ,

I have encountered similar issue. Did you use minimap2 to align reads? Would you please share your commands? I recently noticed that -a and -c argument are crucial in minimap2 alignment, and may produces different alignment results especially at the terminal positions (https://github.com/lh3/minimap2/blob/master/FAQ.md#1-alignment-different-with-option--a-or--c). And this do affect the behavior of racon. Without minimap2 -c, racon seems to remove most telomeres. I would like to known if this is the problem.

Best, Joseph

johnomics commented 4 years ago

Hi @cgjosephlee - thank you very much for this, I can confirm that using minimap2 -c for racon alignments retains the telomeres!

@rvaser, please could you consider mentioning this in the README? Seems likely the telomere trimming might only be the most visible effect of using approximate mappings.

Thanks John

cgjosephlee commented 4 years ago

After a few trials, I still see racon remove telomeres (or low coverage regions) in some cases with minimap2 -a.

johnomics commented 4 years ago

Are they completely removed, or just trimmed? Is it progressive? I saw some of my telomeres altered in length, which I took to be appropriate refinement of the true telomeres, but none of them had the progressive removal that I was seeing in the examples above.

Tedious to check, but is there any difference between using minimap2 -a SAM output and minimap2 -c PAF output?

rvaser commented 4 years ago

Hi @cgjosephlee and @johnomics,

using option -c with Minimap2 will put the CIGAR into PAF which will not be taken into account in Racon and Racon will use Edlib instead to align the overlaps. If you are using -a (SAM output), the CIGAR strings will be used as such. So there is a difference between those two parameters (-c is coupled with edit distance from Edlib, -a is tied with ksw2 alignment from Minimap2). Using -c option with Minimap2 probably discards more false overlaps as alignments are calculated as opposed to plain overlaps based on k-mer indexing.

Sorry for the late response!

Best regards, Robert

cgjosephlee commented 4 years ago

@johnomics Some of them were trimmed progressively, some were refined and being recognized, and some were retained in different iterative runs. So it's in a more complicated situation. Most telomeres were identified in first racon round (better than raw and following rounds).

I did try converting SAM to PAF but long cigar sting in PAF is problematic (#115). I would try to remove cigar in PAF since it is not taken as @rvaser said.

rvaser commented 4 years ago

@cgjosephlee, Racon version 1.4.5 should be able to handle any CIGAR length in PAF so you do not have to remove them manually.

cgjosephlee commented 4 years ago

Thank you for the information!

aineniamh commented 4 years ago

Hi @rvaser, out of interest, is the --no-trim feature likely to make it onto the master branch and into a conda recipe in the near future?

I've been working on short viral sequences and having the ends trimmed or not was making a big difference to me. Whilst having a local binary works for development, I'm hoping to include it in a pipeline that will be more widely distributed and is managed with a conda environment.

I would really like to include this software as testing has shown great results with the current pipeline setup! (as an aside: I've currently got a cherrypicked version of the master with the no-trim options as I needed more recent master commits to compile on a mac.)

Thanks! Áine

rvaser commented 4 years ago

Hi Aine, were you using --no-trim parameter on branch feature_no_trim or not? I am asking because I also removed trimming of first/last window regardless of parameter --no-trim.

Best regards, Robert

aineniamh commented 4 years ago

I was using --no-trimming on the feature_no_trim branch.

Full command: racon/build/bin/racon --no-trimming -t 1 {input.reads} {input.paf} {input.fasta} > {output}

Previously using a conda installed version, with each iteration of racon, I was losing the ends (attached, top is original reference, each below is a consensus generated from each round of racon & minimap2 I ran). The data I'm working on is amplicon-based so read depth is pretty consistent right up to the end of the reads. racon_iterations With the --no-trimming flag, this end loss no longer happens, so seems like a decent fix.

rvaser commented 4 years ago

Available in version 1.4.6 at https://github.com/lbcb-sci/racon.

cgjosephlee commented 4 years ago

--no-trimming option in 1.4.6 works perfectly! Telomeres were retained in iterative runs.

aineniamh commented 4 years ago

Excellent! Thanks for this, it's working well!

johnomics commented 4 years ago

Hi @cgjosephlee - can you retain telomeres using minimap2 -c (PAF + Edlib alignments), rather than minimap2 -a (SAM + ksw2 alignments) & --no-trimming? It's great that the latter works for the telomeres, but as @rvaser points out above, there are good reasons to trim, and it might be better to keep trimming on.

However, I'm not sure minimap2 -c is a good idea either, having taken a closer look at the alignments - while minimap2 is now finding alignments right to the end of the contig, and so racon has some information to polish with, the alignments are almost all small chunks of reads that are probably coming from other chromosome ends. So even if the telomeres are being retained, they are probably not accurate. But the issue is about telomere presence, not telomere accuracy, so I'm happy to close this issue if everyone else is!

rvaser commented 4 years ago

@johnomics & @cgjosephlee, could you also try mapping with option -f 0 so that minimap2 does not filter out repetitive k-mers? I think that should increase coverage at telomeres.

cgjosephlee commented 4 years ago

Ok! All of these trials are started with a ~54mb fungal genome (canu assembly in 21 contigs, 125x nanopore reads).

I used trf to identify if telomere existed in terminal 5kb windows. The number of scaffolds with telomeres in both ends:

a. racon 1.3.2              , minimap2
b. racon 1.3.2              , minimap2 -a
c. racon 1.4.6 --no-trimming, minimap2 -a
d. racon 1.4.6 --no-trimming, minimap2 -c
e. racon 1.4.6              , minimap2 -c
f. racon 1.4.6              , minimap2 -c -f 0

   raw racon_1 racon_2 racon_3 racon_4 racon_5
a    6      10       0       0       0       0
b           11      10       8       8       6
c           11      11      11      11      11
d           11      11      11      11      11
e           11      11       9       9       8
f           11       9       8       6       6
rvaser commented 4 years ago

Thanks @cgjosephlee for the evaluation! I am not sure why they are completely trimmed after few rounds. Do you perhaps know their size through iterations?

cgjosephlee commented 4 years ago

Here is parsed trf output of e, ordered by raw to racon_5. It seems like each end is edited in different pace. e.g. tig00000001 START is trimmed and tig00000001 END is retained.

# cols: ctg ctg_len START pos_start pos_end length repeat_size repeat_copies repeat_seq
tig00000001     6794464 START   1       146     146     6       24.5    CCTAAC
tig00000001     6794464 END     6794379 6794464 86      6       14.2    GTTAGG
tig00000011     4954065 START   1       155     155     6       25.5    CTAACC
tig00000011     4954065 END     4953914 4954065 152     6       25.5    TAGGGT
tig00000019     4291540 END     4291404 4291540 137     6       22.5    TAGGGT
tig00000037     3863498 END     3863350 3863498 149     6       24.7    AGGGTT
tig00000041     3850925 END     3850746 3850898 153     6       26.2    GGTTAG
tig00000045     3809958 START   1       150     150     6       25.2    TAACCC
tig00000062     3813806 START   1       141     141     6       24.7    CCTAAC
tig00000062     3813806 END     3813652 3813806 155     6       26.3    AGGGTT
tig00000083     2781522 START   1       160     160     6       26.7    CCTAAC
tig00000083     2781522 END     2781377 2781522 146     6       24.7    TAGGGT
tig00000095     2712376 END     2712225 2712375 151     6       26.0    GGGTTA
tig00009906     5227866 END     5227733 5227866 134     6       22.5    AGGGTT
tig00009907     3598546 START   3       144     142     6       25.7    CCTAAC
tig00009907     3598546 END     3598407 3598546 140     6       23.3    AGGGTT
tig00009908     45269   END     45118   45269   152     6       25.3    AGGGTT
tig00009910     3536766 START   1       131     131     6       21.8    CCCTAA
tig00009910     3536766 END     3536612 3536764 153     6       25.5    AGGGTT
tig00009912     2065533 END     2065380 2065532 153     6       25.5    TAGGGT
tig00009914     1155944 MID     142316  142388  73      6       12.2    GGGTTA
tig00009914     1155944 END     1155799 1155944 146     6       24.2    TAGGGT

Found in both ends : 6
Found in single end: 9
Found in interval  : 1
['tig00000001', 'tig00000011', 'tig00000062', 'tig00000083', 'tig00009907', 'tig00009910']
['tig00000019', 'tig00000037', 'tig00000041', 'tig00000045', 'tig00000095', 'tig00009906', 'tig00009908', 'tig00009912', 'tig00009914']
['tig00009914']
###
tig00000001     6807939 START   1       118     118     6       19.7    CCCTAA
tig00000001     6807939 END     6807852 6807939 88      6       14.7    AGGGTT
tig00000011     4963113 START   1       139     139     6       23.2    AACCCT
tig00000011     4963113 END     4962966 4963113 148     6       24.7    TAGGGT
tig00000019     4300092 END     4299960 4300092 133     6       22.0    GGTTAG
tig00000037     3870216 START   1       151     151     6       25.5    CTAACC
tig00000037     3870216 END     3870071 3870216 146     6       24.2    AGGGTT
tig00000041     3858223 START   1       158     158     6       26.3    TAACCC
tig00000041     3858223 END     3858064 3858223 160     6       26.7    TAGGGT
tig00000045     3817180 START   1       140     140     6       23.3    CTAACC
tig00000062     3821160 START   1       132     132     6       22.2    AACCCT
tig00000062     3821160 END     3821008 3821160 153     6       25.5    AGGGTT
tig00000083     2786265 START   1       158     158     6       26.3    TAACCC
tig00000083     2786265 END     2786117 2786265 149     6       24.8    GTTAGG
tig00000095     2717590 START   1       134     134     6       23.3    CTAACC
tig00000095     2717590 END     2717455 2717590 136     6       22.8    GGGTTA
tig00009906     5237806 START   1       122     122     6       20.2    CCTAAC
tig00009906     5237806 END     5237651 5237806 156     6       25.5    AGGGTT
tig00009907     3605257 START   1       105     105     6       17.5    CTAACC
tig00009907     3605257 END     3605118 3605257 140     6       23.3    AGGGTT
tig00009908     45509   END     45359   45509   151     6       24.8    GTTAGG
tig00009909     30600   START   1       145     145     6       25.2    AACCCT
tig00009910     3544915 START   1       130     130     6       21.7    CCTAAC
tig00009910     3544915 END     3544768 3544915 148     6       24.7    AGGGTT
tig00009911     37029   START   19      152     134     6       23.7    CCTAAC
tig00009912     2069026 START   1       128     128     6       21.7    CCTAAC
tig00009912     2069026 END     2068878 2069026 149     6       24.8    TAGGGT
tig00009913     82842   START   1       142     142     6       24.2    CCTAAC
tig00009914     1158941 MID     142618  142690  73      6       12.2    GGGTTA
tig00009914     1158941 END     1158800 1158941 142     6       23.7    TAGGGT

Found in both ends : 11
Found in single end: 7
Found in interval  : 1
['tig00000001', 'tig00000011', 'tig00000037', 'tig00000041', 'tig00000062', 'tig00000083', 'tig00000095', 'tig00009906', 'tig00009907', 'tig00009910', 'tig00009912']
['tig00000019', 'tig00000045', 'tig00009908', 'tig00009909', 'tig00009911', 'tig00009913', 'tig00009914']
['tig00009914']
###
tig00000001     6808656 START   1       32      32      6       5.3     CTAACC
tig00000001     6808656 END     6808569 6808656 88      6       14.7    AGGGTT
tig00000011     4963611 START   1       136     136     6       22.7    CCTAAC
tig00000011     4963611 END     4963466 4963611 146     6       24.3    TAGGGT
tig00000019     4300687 END     4300553 4300687 135     6       22.5    TAGGGT
tig00000037     3870489 START   1       88      88      6       14.7    CCTAAC
tig00000037     3870489 END     3870340 3870489 150     6       25.0    AGGGTT
tig00000041     3858484 START   1       141     141     6       23.5    CTAACC
tig00000041     3858484 END     3858331 3858484 154     6       25.3    TAGGGT
tig00000045     3817561 START   1       146     146     6       24.3    CTAACC
tig00000062     3821571 START   1       132     132     6       22.2    AACCCT
tig00000062     3821571 END     3821422 3821571 150     6       25.0    AGGGTT
tig00000083     2786471 START   1       148     148     6       24.7    CCTAAC
tig00000083     2786471 END     2786332 2786471 140     6       23.3    TAGGGT
tig00000095     2717876 START   1       134     134     6       22.8    ACCCTA
tig00000095     2717876 END     2717738 2717876 139     6       22.8    TAGGGT
tig00009906     5238475 START   1       126     126     6       20.7    CCTAAC
tig00009906     5238475 END     5238322 5238475 154     6       25.0    AGGGTT
tig00009907     3605545 START   1       103     103     6       17.2    AACCCT
tig00009907     3605545 END     3604378 3604438 61      6       10.2    CCTAAC
tig00009907     3605545 END     3605406 3605545 140     6       23.3    AGGGTT
tig00009908     45524   END     45373   45524   152     6       25.3    AGGGTT
tig00009909     30582   START   1       137     137     6       22.7    CCTAAC
tig00009910     3545172 START   1       128     128     6       21.3    TAACCC
tig00009910     3545172 END     3545025 3545172 148     6       24.7    AGGGTT
tig00009912     2069268 START   1       131     131     6       21.7    CCTAAC
tig00009912     2069268 END     2069123 2069268 146     6       24.5    TAGGGT
tig00009913     82786   START   1       88      88      6       14.7    CCTAAC
tig00009914     1158999 MID     142609  142681  73      6       12.2    GGGTTA
tig00009914     1158999 END     1158863 1158999 137     6       22.8    GGTTAG

Found in both ends : 11
Found in single end: 6
Found in interval  : 1
['tig00000001', 'tig00000011', 'tig00000037', 'tig00000041', 'tig00000062', 'tig00000083', 'tig00000095', 'tig00009906', 'tig00009907', 'tig00009910', 'tig00009912']
['tig00000019', 'tig00000045', 'tig00009908', 'tig00009909', 'tig00009913', 'tig00009914']
['tig00009914']
###
tig00000001     6809062 END     6808976 6809062 87      6       14.5    AGGGTT
tig00000011     4963763 START   1       136     136     6       22.7    CCTAAC
tig00000011     4963763 END     4963619 4963763 145     6       24.2    TAGGGT
tig00000019     4300570 END     4300437 4300570 134     6       22.3    TAGGGT
tig00000037     3870547 END     3870396 3870547 152     6       25.3    AGGGTT
tig00000041     3858545 START   1       139     139     6       23.2    AACCCT
tig00000041     3858545 END     3858392 3858545 154     6       25.3    TAGGGT
tig00000045     3817646 START   1       141     141     6       23.5    CCTAAC
tig00000062     3821602 START   1       129     129     6       21.7    CCTAAC
tig00000062     3821602 END     3821453 3821602 150     6       25.0    AGGGTT
tig00000083     2786584 START   1       145     145     6       24.3    TAACCC
tig00000083     2786584 END     2786449 2786584 136     6       22.7    TAGGGT
tig00000095     2717798 START   1       110     110     6       18.3    CTAACC
tig00000095     2717798 END     2717655 2717798 144     6       23.3    TAGGGT
tig00009906     5238611 START   1       122     122     6       20.2    CCTAAC
tig00009906     5238611 END     5238465 5238611 147     6       24.0    AGGGTT
tig00009907     3605733 START   1       79      79      6       13.2    AACCCT
tig00009907     3605733 END     3604566 3604626 61      6       10.2    CCTAAC
tig00009907     3605733 END     3605594 3605733 140     6       23.3    AGGGTT
tig00009908     45505   END     45355   45505   151     6       25.3    AGGGTT
tig00009909     30578   START   1       134     134     6       22.3    TAACCC
tig00009910     3545099 START   1       128     128     6       21.3    TAACCC
tig00009910     3545099 END     3544953 3545099 147     6       24.5    AGGGTT
tig00009911     37025   START   1       136     136     6       23.7    CCTAAC
tig00009912     2069247 START   1       130     130     6       21.7    CCTAAC
tig00009912     2069247 END     2069103 2069247 145     6       24.3    TAGGGT
tig00009913     82779   START   1       87      87      6       14.5    CTAACC
tig00009914     1158854 MID     142437  142509  73      6       12.2    GGGTTA
tig00009914     1158854 END     1158727 1158854 128     6       21.3    TAGGGT

Found in both ends : 9
Found in single end: 9
Found in interval  : 1
['tig00000011', 'tig00000041', 'tig00000062', 'tig00000083', 'tig00000095', 'tig00009906', 'tig00009907', 'tig00009910', 'tig00009912']
['tig00000001', 'tig00000019', 'tig00000037', 'tig00000045', 'tig00009908', 'tig00009909', 'tig00009911', 'tig00009913', 'tig00009914']
['tig00009914']
###
tig00000001     6809107 END     6809021 6809107 87      6       14.5    AGGGTT
tig00000011     4963915 START   1       141     141     6       23.7    CCTAAC
tig00000011     4963915 END     4963775 4963915 141     6       23.5    TAGGGT
tig00000019     4300923 END     4300790 4300923 134     6       22.3    TAGGGT
tig00000037     3870575 END     3870424 3870575 152     6       25.0    AGGGTT
tig00000041     3858663 START   1       129     129     6       21.5    CTAACC
tig00000041     3858663 END     3858511 3858663 153     6       25.3    TAGGGT
tig00000045     3817740 START   1       146     146     6       24.3    CTAACC
tig00000062     3821657 START   1       122     122     6       20.7    CCTAAC
tig00000062     3821657 END     3821513 3821657 145     6       24.2    AGGGTT
tig00000083     2786586 START   1       145     145     6       24.3    TAACCC
tig00000083     2786586 END     2786451 2786586 136     6       22.7    TAGGGT
tig00000095     2717907 START   1       53      53      6       8.8     ACCCTA
tig00000095     2717907 END     2717798 2717907 110     6       18.3    GGGTTA
tig00009906     5238813 START   1       129     129     6       21.3    TAACCC
tig00009906     5238813 END     5238667 5238813 147     6       24.0    AGGGTT
tig00009907     3605814 START   1       79      79      6       13.2    AACCCT
tig00009907     3605814 END     3604655 3604715 61      6       10.2    CCTAAC
tig00009907     3605814 END     3605683 3605814 132     6       22.0    AGGGTT
tig00009908     45279   END     45122   45279   158     6       26.3    AGGGTT
tig00009909     30552   START   1       127     127     6       21.2    AACCCT
tig00009910     3545373 START   1       128     128     6       21.3    TAACCC
tig00009910     3545373 END     3545228 3545373 146     6       24.3    AGGGTT
tig00009911     37049   START   1       137     137     6       23.2    AACCCT
tig00009912     2069247 START   1       131     131     6       21.7    CCTAAC
tig00009912     2069247 END     2069106 2069247 142     6       23.7    TAGGGT
tig00009913     82776   START   1       87      87      6       14.5    CTAACC
tig00009914     1159092 MID     142619  142691  73      6       12.2    GGGTTA
tig00009914     1159092 END     1158992 1159092 101     6       16.8    GGTTAG

Found in both ends : 9
Found in single end: 9
Found in interval  : 1
['tig00000011', 'tig00000041', 'tig00000062', 'tig00000083', 'tig00000095', 'tig00009906', 'tig00009907', 'tig00009910', 'tig00009912']
['tig00000001', 'tig00000019', 'tig00000037', 'tig00000045', 'tig00009908', 'tig00009909', 'tig00009911', 'tig00009913', 'tig00009914']
['tig00009914']
###
tig00000001     6809188 END     6809102 6809188 87      6       14.5    AGGGTT
tig00000011     4963882 START   1       153     153     6       25.5    CTAACC
tig00000011     4963882 END     4963742 4963882 141     6       23.5    TAGGGT
tig00000019     4300852 END     4300725 4300852 128     6       21.3    TAGGGT
tig00000037     3870685 END     3870533 3870685 153     6       25.5    AGGGTT
tig00000041     3858715 START   1       122     122     6       20.3    TAACCC
tig00000041     3858715 END     3858559 3858715 157     6       26.3    TAGGGT
tig00000045     3817878 START   1       141     141     6       23.5    CCTAAC
tig00000062     3821796 START   1       116     116     6       19.5    CTAACC
tig00000062     3821796 END     3821647 3821796 150     6       25.0    AGGGTT
tig00000083     2786661 START   1       145     145     6       24.3    TAACCC
tig00000083     2786661 END     2786526 2786661 136     6       22.7    TAGGGT
tig00000095     2717630 START   1       51      51      6       8.5     CCTAAC
tig00009905     29614   START   1       93      93      6       16.0    CTAACC
tig00009906     5238688 START   1       129     129     6       21.3    TAACCC
tig00009906     5238688 END     5238545 5238688 144     6       23.5    AGGGTT
tig00009907     3605773 START   1       76      76      6       12.7    CCTAAC
tig00009907     3605773 END     3604632 3604692 61      6       10.2    CCTAAC
tig00009907     3605773 END     3605661 3605773 113     6       18.5    AGGGTT
tig00009908     44941   END     44784   44941   158     6       26.3    AGGGTT
tig00009909     30384   START   1       79      79      6       13.7    CCTAAC
tig00009910     3545216 START   1       128     128     6       21.3    TAACCC
tig00009910     3545216 END     3545078 3545216 139     6       23.2    AGGGTT
tig00009911     37038   START   1       135     135     6       22.7    CCTAAC
tig00009912     2069300 START   1       127     127     6       21.2    CCTAAC
tig00009912     2069300 END     2069162 2069300 139     6       23.2    TAGGGT
tig00009913     82753   START   1       87      87      6       14.5    CTAACC
tig00009914     1159066 MID     142651  142723  73      6       12.2    GGGTTA
tig00009914     1159066 END     1158966 1159066 101     6       16.8    GGTTAG

Found in both ends : 8
Found in single end: 11
Found in interval  : 1
['tig00000011', 'tig00000041', 'tig00000062', 'tig00000083', 'tig00009906', 'tig00009907', 'tig00009910', 'tig00009912']
['tig00000001', 'tig00000019', 'tig00000037', 'tig00000045', 'tig00000095', 'tig00009905', 'tig00009908', 'tig00009909', 'tig00009911', 'tig00009913', 'tig00009914']
['tig00009914']
johnomics commented 4 years ago

-f 0 doesn't make much difference for me either; with -c, it extends some telomeres, but shortens others. minimap2 takes over an hour to run though, rather than 5 minutes without -f 0.

rvaser commented 4 years ago

Not sure what the problem is then and I do not know if using -no-trimming will decrease the overall accuracy by much. :/