rajewsky-lab / find_circ2

Find circRNAs in RNA-seq data.
GNU General Public License v3.0
13 stars 6 forks source link

No any sequences after the find_circ2 #10

Closed Runningchuan closed 5 years ago

Runningchuan commented 5 years ago

I have finsh my circRNA work with CIRI_v2.0.6, and get nearly 20000 circRNAs from each sample. Then i start my work with find_circ2,but i got no results at the last step. I don't the reason.

This is the code and results i have got.

1.

bowtie2 \
    -p 8 \
    --very-sensitive \
    --score-min=C,-15,0 \
    --mm \
    -x /media/lyc/data/refence/mouse/mm10_bowtie2_index/mm10 \
    -q \
    -1 con1_RMrRNA_1.fq.gz \
    -2 con1_RMrRNA_2.fq.gz \
2> bowtie2.log \
| samtools view \
      -hbuS 
> con1.bam && \
samtools sort \
    con1.bam \
> con1.sort.bam

This is the bowtie2 log:

60095995 reads; of these:
60095995 (100.00%) were paired; of these:
17254530 (28.71%) aligned concordantly 0 times
35318635 (58.77%) aligned concordantly exactly 1 time
7522830 (12.52%) aligned concordantly >1 times
----
17254530 pairs aligned concordantly 0 times; of these:
1420781 (8.23%) aligned discordantly 1 time
----
15833749 pairs aligned 0 times concordantly or discordantly; of these:
31667498 mates make up the pairs; of these:
22405301 (70.75%) aligned 0 times
8442762 (26.66%) aligned exactly 1 time
819435 (2.59%) aligned >1 times
81.36% overall alignment rate

2.

samtools view \
    -hf 4 \
    con1.sort.bam \
| samtools view \
      -Sb \
> con1_unmap_sort.bam && \
python unmapped2anchors.py \
    con1_unmap_sort.bam \
| gzip \
> con1_anchors.fastq.gz

3.

bowtie2 \
    -p 10 \
    --score-min=C,-15,0 \
    --reorder \
    --mm \
    -q \
    -U con1_anchors.fastq.gz \
    -x /media/lyc/data/refence/mouse/mm10_bowtie2_index/mm10 \
| ./find_circ.py \
       -G /media/lyc/data/refence/mouse/mm10.fa

This is the bowtie2 log:

44810602 reads; of these:
44810602 (100.00%) were unpaired; of these:
7788756 (17.38%) aligned 0 times
19133030 (42.70%) aligned exactly 1 time
17888816 (39.92%) aligned >1 times
82.62% overall alignment rate
processed 36.51M (paired or single end) reads in 7.7 minutes (overall 79.30k reads/second on average)
results stored in 'find_circ_run'

And this is the run.log in the find_circ_run folder:

2019-08-05 08:47:14,150 INFO find_circ find_circ 1.99 invoked as './find_circ.py -G /media/lyc/data/refence/mouse/mm10.fa'
2019-08-05 08:47:17,981 INFO find_circ reading from stdin
2019-08-05 08:51:29,266 INFO GenomeAccessor # mmap: Loading genomic sequence for chromosome chr9 from '/media/lyc/data/refence/mouse/mm10.fa'
2019-08-05 08:51:29,266 INFO indexed_fasta # indexed_fasta.load_index('/media/lyc/data/refence/mouse/mm10.fa.byo_index')
2019-08-05 08:54:58,679 INFO find_circ processed 36.51M (paired or single end) reads in 7.7 minutes (overall 79.30k reads/second on average)
2019-08-05 08:54:58,680 INFO find_circ run finished
2019-08-05 08:54:58,680 INFO find_circ circ_junc_not_unique=4104.0
2019-08-05 08:54:58,716 INFO find_circ circ_no_bp=35.0
2019-08-05 08:54:58,716 INFO find_circ lin_junc_not_unique=3912.0
2019-08-05 08:54:58,716 INFO find_circ lin_no_bp=7.0
2019-08-05 08:54:58,716 INFO find_circ total_mates=36510597.0
2019-08-05 08:54:58,716 INFO find_circ unmapped_reads=7788756.0
2019-08-05 08:54:58,716 INFO find_circ unspliced_mates=36502539.0

the other files such as circ or lin splice_sites.bed and spliced_reads.fastq.gz have no sequences, only have the header.

i don't which step was wrong.

Thanks!


edit (@mschilli87): style

mschilli87 commented 5 years ago

Dear @Runningchuan,

Please note that find_circ2 is still unpublished and cannnot be considered ready-to-use as a black-box tool. Most of the documentations still refers to find_circ1. This will hopefully change soon but is still the current state. Note that the main difference between find_circ1 and find_circ2 is that the latter is not meant to be run on bowtie2-mapped anchors but (unsorted!) bwa-mem-mapped reads.

Hope this helps to get you started.

Best, Marcel

Runningchuan commented 5 years ago

Hi @mschilli87 Thank you for your reply help me to solve this problem. I try to use find_circ1 to processing my data again,but at the step of find_circ.py have an error:

This is the code and result:

1. bowtie2 \ -p 8 \ --very-sensitive \ --phred64 \ --mm \ --score-min=C,-15,0 \ -x /media/lyc/data/refence/mouse/mm10_bowtie2_index/mm10 \ -q -1 /media/lyc/work/circRNA/CIRI2/con1_1.P.trim.gz \ -2 /media/lyc/work/circRNA/CIRI2/con1_2.P.trim.gz \ 2>con1_bowtie2_mapp.result.txt\ | samtools view -hbuS \ | samtools sort \ | samtools view -hf 4\ | samtools view -Sb \

con1_unmap_sort.bam

This is the bowtie2 log:

        55092837 reads; of these:
          55092837 (100.00%) were paired; of these:
            14453309 (26.23%) aligned concordantly 0 times
            33928030 (61.58%) aligned concordantly exactly 1 time
            6711498 (12.18%) aligned concordantly >1 times
            ----
            14453309 pairs aligned concordantly 0 times; of these:
              1397170 (9.67%) aligned discordantly 1 time
            ----
            13056139 pairs aligned 0 times concordantly or discordantly; of these:
              26112278 mates make up the pairs; of these:
                18712081 (71.66%) aligned 0 times
                6859099 (26.27%) aligned exactly 1 time
                541098 (2.07%) aligned >1 times
        83.02% overall alignment rate

2. unmapped2anchors.py script

  python unmapped2anchors.py \
                 con1_unmap_sort.bam  \
           |      gzip \
          >con1_anchors.fastq.gz

3. Then bowtie2 the con1_anchors.fastq.gz

       bowtie2 \
                 -p 8 \
                 --score-min=C,-15,0 \
                --reorder \
                --mm \
                -q \
                -U con1_anchors.fastq.gz \
                -x /media/lyc/data/refence/mouse/mm10_bowtie2_index/mm10 \
          2>second_bowtie.log \
            >con1_anchonor_map.bam

This is the second_bowtie2.log:

        37424162 reads; of these:
          37424162 (100.00%) were unpaired; of these:
            6360193 (16.99%) aligned 0 times
            16625264 (44.42%) aligned exactly 1 time
            14438705 (38.58%) aligned >1 times
        83.01% overall alignment rate

4. At last , I use the find_circ.py , but it got an error :

     cat con1_anchonor_map.bam \
            |     ./find_circ.py \
                -G /media/lyc/data/refence/mouse/mm10.fa \
                --name=con1

The error is below:

ERROR:root:Unhandled exception raised while processing alignment pair 458549, on input starting at SAM line 917098, FASTQ line 3668392 Traceback (most recent call last): File "./find_circ.py", line 651, in bp = find_breakpoints(A,B,read,chrom) File "./find_circ.py", line 545, in find_breakpoints A_flank = genome.get(chrom,A.aend-margin,A.aend-margin + flank,'+').upper() File "./find_circ.py", line 292, in get return acc.get_data(chrom,start,end,sense,kwargs) File "./find_circ.py", line 343, in get_data seq = self.data.get_data(chrom,start,end,"+") File "./find_circ.py", line 167, in get_data ofs,ldata,skip,skip_char,size = self.chrom_stats[chrom] **KeyError: 'chrUn_GL456378'

Then I check the SAM file line 917098 with upper 8 line ,and FASTQ file line 3668392 upper 20lines ,but I found no difference. I don't kown the reason.

This is the SAM file

line:917091### | K00137:313:HCCK2BBXX:5:2222:30553:8822_A__TGAAAGAAAGAAAGAAAGAAAGAAAGAGAAAAGAAGGGAAAAGAAAAGAAAAGAACAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAGAAGAGAAGAG | 16 | chr3 | 89329838 | 1 | 20M | * | 0 | 0TTCTTTCTTTCTTTCTTTCA | JJJJJJJJJJJJJJJFFFAA | AS:i:0 | XS:i:0 | XN:i:0 | XM:i:0 | XO:i:0 | XG:i:0 | NM:i:0 | MD:Z:20 | YT:Z:UU |  

-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --

line:917092### | K00137:313:HCCK2BBXX:5:2222:30553:8822_B | 16 | chr2 | 60475292 | 1 | 20M | * | 0 | 0 | CTCTTCTCTTCTCTTTTCTT | JJJJJJJJJJJJJJJJJJJJ | AS:i:0 | XS:i:0 | XN:i:0 | XM:i:0 | XO:i:0 | XG:i:0 | NM:i:0 | MD:Z:20 | YT:Z:UU

line:917093### | K00137:313:HCCK2BBXX:5:2205:15595:4426_A__CTTGGAGCTCGCGCCTCCTGCGGCGTCGCGGTCCCTGGTGAGCCGTAGAGGGCCAGTGCCATGATCCGACAAGAACTCTCCACGTCCTACCAGGAGCTCA | 0 | chr1 | 191325981 | 42 | 20M | * | 0 | 0CTTGGAGCTCGCGCCTCCTG | AAFFFFJJJJJJJJJJJJJJ | AS:i:0 | XN:i:0 | XM:i:0 | XO:i:0 | XG:i:0 | NM:i:0 | MD:Z:20 | YT:Z:UU |   |  

line:917094### | K00137:313:HCCK2BBXX:5:2205:15595:4426_B | 4 | | 0 | 0 | | * | 0 | 0 | CACGTCCTACCAGGAGCTCA | JJJJJJJJFJJJJJAJJJJJ | YT:Z:UU |   |   |   |   |   |   |   |  

line:917095### | K00137:313:HCCK2BBXX:5:2205:15574:4532_A__CTTGGAGCTCGCGCCTCCTGCGGCGTCGCGGTCCCTGGTGAGCCGTAGAGGGCCAGTGCCATGATCCGACAAGAACTCTCCACGTCCTACCAGGAGCTCA | 0 | chr1 | 191325981 | 42 | 20M | * | 0 | 0CTTGGAGCTCGCGCCTCCTG | AAAFFJJJJJJJJJJFJJJJ | AS:i:0 | XN:i:0 | XM:i:0 | XO:i:0 | XG:i:0 | NM:i:0 | MD:Z:20 | YT:Z:UU |   |  

line:917096### | K00137:313:HCCK2BBXX:5:2205:15574:4532_B | 4 | | 0 | 0 | | * | 0 | 0 | CACGTCCTACCAGGAGCTCA | FFJJFFJJFJJJJJAJJAFF | YT:Z:UU |   |   |   |   |   |   |   |  

line:917097### | K00137:313:HCCK2BBXX:5:2205:16112:6765_A__CTTGGAGCTCGCGCCTCCTGCGGCGTCGCGGTCCCTGGTGAGCCGTAGAGGGCCAGTGCCATGATCCGACAAGAACTCTCCACGTCCTACCAGGAGCTCA | 0 | chr1 | 191325981 | 42 | 20M | * | 0 | 0CTTGGAGCTCGCGCCTCCTG | AAFFFJJJJJJJJJJJJJJJ | AS:i:0 | XN:i:0 | XM:i:0 | XO:i:0 | XG:i:0 | NM:i:0 | MD:Z:20 | YT:Z:UU |   |  

line:917098### | K00137:313:HCCK2BBXX:5:2205:16112:6765_B | 4 | | 0 | 0 | | * | 0 | 0 | CACGTCCTACCAGGAGCTCA | JJJJJJJJJJJJJJAJJFJJ | YT:Z:UU |   |   |   |   |   |   |   |  

This is the FASTQ file

line:3668373### | @K00137:313:HCCK2BBXX:4:2211:27417:2299_B

-- | --

line:3668374### | CCCCTCCCCTCCCCTCTCTT

line:3668375### | +

line:3668376### | AJJJAAJFA-FFJJ<A7AAA

line:3668377### | @K00137:313:HCCK2BBXX:5:2112:19431:15399_A__AGCTGTTGAGTTCCGTCAAGAGACCAGCGTGGTTAACTATATTGACCAGAGACCCGCAGCCGAGAGGAGTGCTCAGTTGTTCTTTGTGGTCTTCGAATGG

line:3668378### | AGCTGTTGAGTTCCGTCAAG

line:3668379### | +

line:3668380### | AAFFFJJJJJJJJJJJJJJJ

line:3668381### | @K00137:313:HCCK2BBXX:5:2112:19431:15399_B

line:3668382### | TCTTTGTGGTCTTCGAATGG

line:3668383### | +

line:3668384### | JJJFJJFJJJJJJJJJJJJJ

line:3668385### | @K00137:309:HC2TFBBXX:6:1115:14154:7732_A__CTTTGTAGGAACTAGTAAGGTCGTCAAAAAGCTTGCCGTTCATCTCCATGAGGGTTTTCAGCACATTGTACACCAGTGCTACAATAAATGCTGCTCAATC

line:3668386### | CTTTGTAGGAACTAGTAAGG

line:3668387### | +

line:3668388### | AAFFFJJJJJJJJJJJJJJJ

line:3668389### | @K00137:309:HC2TFBBXX:6:1115:14154:7732_B

line:3668390### | ACAATAAATGCTGCTCAATC

line:3668391### | +

line:3668392### | JJJJJJJJJJJJJJJJJJJJ

Could you help me help me to solve this problem?

Thanks!

mschilli87 commented 5 years ago

@Runningchuan: I'm sorry but find_circ1 is not used anymore since many years and I really cannot offer any support for it anymore. We are aware that teh state of find_circ2 is anything but ideal but given the EOL of Python 2.x there is no point in updating the documentation before we finde the time to port it to Python 3.x.

Until then, I highly recommend using find_circ2 as-is in combination with bwa-mem. I could give you further assistence if you get stuck anywhere down that road.