jradrion / TEFLoN

TEFLoN uses paired-end illumina sequence data to discover and genotype transposable elements present in your samples.
13 stars 7 forks source link

The genotype folder empty. #6

Closed rmercuri closed 3 years ago

rmercuri commented 4 years ago

Hi @jradrion

I try your pipeline in one human sample (coverage 40x), but for some reason the genotype folder is empty and all the files in countPos/ are empty too.

Follow my commands:

python teflon/TEFLoN/teflon_prep_annotation.py -wd output/teflon/ -a ann/hg38/TE_teflon.sorted.bed -t ann/hg38/TE_hierarchy.sorted.txt\
 -g /home/genomes/Homo_sapiens/hg38/hg38.fa\
 -p all_te 2> log.teflon

bwa index output/teflon/all_te.prep_MP/all_te.mappingRef.fa

bwa mem -t 36 -Y output/teflon/all_te.prep_MP/all_te.mappingRef.fa input/fastq/JSR_N_R1.fastq input/fastq/JSR_N_R2.fastq > /home/JSR_N.sam

python teflon/TEFLoN/teflon.v0.4.py -d output/teflon/all_te.prep_TF/ -s output/teflon/sample.txt -i JSR_N -eb /home/tools/bin/bwa -es /home/tools/bin/samtools\
 -l1 family -l2 family -q 10 -t 36 -sd 820

**-sd was calculated by teflon.v0.4.py**

python teflon/TEFLoN/teflon_collapse.py -d output/teflon/teflon.prep_TF/ -s output/teflon/sample.txt -es /home/tools/bin/samtools -n1 1 -n2 1 -q 20 -t 10

python teflon/TEFLoN/teflon_count.py  -d output/teflon/teflon.prep_TF/ -s output/teflon/sample.txt -i JSR_N -eb /home/tools/bin/bwa -es /home/tools/bin/samtools\
 -l2 family -q 20 -t 12

python teflon/TEFLoN/teflon_genotype.py  -d output/teflon/teflon.prep_TF/ -s output/teflon/sample.txt -dt diploid

Input files: Just a head of:

hierarchyfile id family order chr10100000951100001262AluSg ALU nltr chr10100001399100001712AluJb ALU nltr chr10100002786100003067L1M5 LINE1 nltr chr10100003067100003374AluSc8 ALU nltr chr10100003374100003422L1M5 LINE1 nltr chr10100003580100003705AluJb ALU nltr chr10100003746100003836AluJb ALU nltr chr10100003949100004255AluJb ALU nltr chr10100004299100004632L1M5 LINE1 nltr

TE_BED chr1 11504 11675 chr11150411675L1MC5a 484 - chr1 26790 27053 chr12679027053AluSp 2070 + chr1 29901 30198 chr12990130198L1MB3 1323 + chr1 31435 31733 chr13143531733AluJo 2059 + chr1 33047 33456 chr13304733456L1MB5 2058 + chr1 33465 33509 chr13346533509Alu 233 + chr1 33528 34041 chr13352834041L1PA6 4051 - chr1 34047 34108 chr13404734108L1P1 456 + chr1 35366 35499 chr13536635499AluJr 1000 + chr1 39623 39924 chr13962339924AluSx 2292 +

The uniqid is chr+start+end_subfamily

Do you have any clue to solve this problem ? Thanks, Rafael

jradrion commented 4 years ago

Hi @rmercuri

Could you provide me with the error/output messages? I cannot currently tell what part of the pipeline might have failed or why.

Also, can you confirm that your were able to successfully run TELoN on the example files provided in this repo?

rmercuri commented 4 years ago

The messages about : teflon_prep_annotation.py

Reading TE annotation: ann/hg38/TE_teflon.final.sorted.bed
Reading reference: /home/genomes/Homo_sapiens/hg38/hg38.fa
chr10 133797422 finished
chr11 135086622 finished
chr12 133275309 finished
chr13 114364328 finished
chr14 107043718 finished
chr15 101991189 finished
chr16 90338345 finished
chr17 83257441 finished
chr18 80373285 finished
chr19 58617616 finished
chr1 248956422 finished
chr20 64444167 finished
chr21 46709983 finished
chr22 50818468 finished
chr2 242193529 finished
chr3 198295559 finished
chr4 190214555 finished
chr5 181538259 finished
chr6 170805979 finished
chr7 159345973 finished
chr8 145138636 finished
chr9 138394717 finished
chrM 16569 finished
chrX 156040895 finished
chrY 57227415 finished
Writing annotated TEs as fasta: /home/scratch30/rmercuri_14_jul/output/teflon/all_te.prep_MP/all_te.annotatedTE.fa
Writing genome size file: /home/scratch30/rmercuri_14_jul/output/teflon/all_te.prep_TF/all_te.genomeSize.txt
Writing pseudospace to refspace conversion map: /home/scratch30/rmercuri_14_jul/output/teflon/all_te.prep_TF/all_te.pseudo2ref.pickle.gz
Writing refspace to pseudospace conversion map: /home/scratch30/rmercuri_14_jul/output/teflon/all_te.prep_TF/all_te.ref2pseudo.pickle.gz
Writing reference in pseudospace: /home/scratch30/rmercuri_14_jul/output/teflon/all_te.prep_MP/all_te.pseudo.fa
Dumping pickles...
/home/scratch30/rmercuri_14_jul/output/teflon/all_te.prep_TF/all_te.pseudo2ref.pickle dumped!
/home/scratch30/rmercuri_14_jul/output/teflon/all_te.prep_TF/all_te.ref2pseudo2.pickle dumped!
Converting ann/hg38/TE_teflon.final.sorted.bed to pseudospace ...
Concatonating reference and TE sequences
Reference prep complete.
Map reads to mapping reference: /home/scratch30/rmercuri_14_jul/output/teflon/all_te.prep_MP/all_te.mappingRef.fa

teflon.v0.4.py


Calculating alignment statistics
cmd: /home/tools/bin/samtools stats -t /home/scratch30/rmercuri_14_jul/output/teflon/all_te.prep_TF/all_te.genomeSize.txt /home/scratch60/rmercuri.23.jun/JSR_N.sorted.$
eflon.bam
cmd: /home/tools/bin/samtools depth -Q 10 /home/scratch60/rmercuri.23.jun/JSR_N.sorted.teflon.bam | awk '{sum+=$3; sumsq+=$3*$3} END {print "Average = ",sum/NR; print $
Stdev = ",sqrt(sumsq/NR - (sum/NR)**2)}' > /home/scratch60/rmercuri.23.jun/JSR_N.sorted.teflon.cov.txt
/home/tools/bin/samtools: /home/tools/lib/libcrypto.so.1.0.0: no version information available (required by /home/tools/bin/samtools)
/home/tools/bin/samtools: /lib64/libldap_r-2.4.so.2: no version information available (required by /home/tools/lib/libcurl-gnutls.so.4)
/home/tools/bin/samtools: /lib64/liblber-2.4.so.2: no version information available (required by /home/tools/lib/libcurl-gnutls.so.4)
Groups to search: ['ALU', 'HERVK', 'LINE1', 'SVA']

writing TE bed files...
writing TE bed files completed!
reducing search space...
cmd: /home/tools/bin/samtools view -@ 36 -L /home/scratch30/rmercuri_14_jul/JSR_N.bed_files/mega_complete.bed /home/scratch60/rmercuri.23.jun/JSR_N.sorted.teflon.bam -$
search space succesfully reduced...
new reduced bam file: /home/scratch30/rmercuri_14_jul/JSR_N.sam_files/mega_complete.bam
clustering TE positions...

clustering TE positions completed!
final reduction of search space...
cmd: /home/tools/bin/samtools view -@ 36 -q 10 -L /home/scratch30/rmercuri_14_jul/JSR_N.bed_files/mega_clustered.bed /home/scratch60/rmercuri.23.jun/JSR_N.sorted.teflon
.bam -b
search space succesfully reduced...
new reduced bam file: /home/scratch30/rmercuri_14_jul/JSR_N.sam_files/mega_clustered.bam
estimating TE breakpoints...

estimating TE breakpoints completed!
Sorting positions...
TEFLON DISCOVERY FINISHED!

teflon_collapse.py


Subsampling to the minimum read depth from a single sample: 28.0
Creating /home/scratch60/rmercuri.23.jun/JSR_N.sorted.teflon.subsmpl.bam
cmd: cp /home/scratch60/rmercuri.23.jun/JSR_N.sorted.teflon.bam /home/scratch60/rmercuri.23.jun/JSR_N.sorted.teflon.subsmpl.bam
cmd: /home/tools/bin/samtools index /home/scratch60/rmercuri.23.jun/JSR_N.sorted.teflon.subsmpl.bam
/home/tools/bin/samtools: /home/tools/lib/libcrypto.so.1.0.0: no version information available (required by /home/tools/bin/samtools)
/home/tools/bin/samtools: /lib64/libldap_r-2.4.so.2: no version information available (required by /home/tools/lib/libcurl-gnutls.so.4)
/home/tools/bin/samtools: /lib64/liblber-2.4.so.2: no version information available (required by /home/tools/lib/libcurl-gnutls.so.4)
Calculating alignment statistics for /home/scratch60/rmercuri.23.jun/JSR_N.sorted.teflon.subsmpl.bam
cmd: /home/tools/bin/samtools stats -t /home/scratch30/rmercuri_14_jul/output/teflon/teflon.prep_TF/teflon.genomeSize.txt /home/scratch60/rmercuri.23.jun/JSR_N.sorted.t
eflon.subsmpl.bam
cmd: /home/tools/bin/samtools depth -Q 20 /home/scratch60/rmercuri.23.jun/JSR_N.sorted.teflon.subsmpl.bam | awk '{sum+=$3; sumsq+=$3*$3} END {print "Average = ",sum/NR;
 print "Stdev = ",sqrt(sumsq/NR - (sum/NR)**2)}' > /home/scratch60/rmercuri.23.jun/JSR_N.sorted.teflon.subsmpl.cov.txt
/home/tools/bin/samtools: /home/tools/lib/libcrypto.so.1.0.0: no version information available (required by /home/tools/bin/samtools)
/home/tools/bin/samtools: /lib64/libldap_r-2.4.so.2: no version information available (required by /home/tools/lib/libcurl-gnutls.so.4)
/home/tools/bin/samtools: /lib64/liblber-2.4.so.2: no version information available (required by /home/tools/lib/libcurl-gnutls.so.4)
finished standardaizing sample depth
collapse: JSR_N

finished collapsing samples
Sorting positions
Collapse union of all samples...

TEFLON COLLAPSE FINISHED!

teflon_count.py


creating directory: /home/scratch30/rmercuri_14_jul/JSR_N.tmp
reducing search space...

 cmd: /home/tools/bin/samtools view -@ 12 -q 20 -L /home/scratch30/rmercuri_14_jul/JSR_N.tmp/megaBed.bed /home/scratch60/rmercuri.23.jun/JSR_N.sorted.teflon.subsmpl.bam 
 -b
 search space succesfully reduced...
 new reduced bam file: /home/scratch30/rmercuri_14_jul/JSR_N.tmp/megaBed.bam
 cmd: /home/tools/bin/bwa index /home/scratch30/rmercuri_14_jul/JSR_N.tmp/megaBed.bam
 [bwa_index] Pack FASTA... 0.00 sec
 [bwa_index] Construct BWT for the packed sequence...
 [bwa_index] 0.00 seconds elapse.
 [bwa_index] Update BWT... 0.00 sec
 [bwa_index] Pack forward-only FASTA... 0.00 sec
 [bwa_index] Construct SA from BWT and Occ... 0.00 sec
 [main] Version: 0.7.9a-r786
 [main] CMD: /home/tools/bin/bwa index /home/scratch30/rmercuri_14_jul/JSR_N.tmp/megaBed.bam
 [main] Real time: 0.076 sec; CPU: 0.014 sec
 counting reads...

 TEFLON COUNT FINISHED!

teflon_genotype.py


Lower-bound coverage threshold filters corresponding to samples ['JSR_N'] is [1]
NOTE: all sites with adjusted read counts > upper-bound coverage threshold will be marked -9
Upper-bound coverage threshold filters corresponding to samples ['JSR_N'] is [109]
NOTE: all sites with adjusted read counts > upper-bound coverage threshold will be marked -9
cdm: gunzip -c /home/scratch30/rmercuri_14_jul/output/teflon/teflon.prep_TF/teflon.pseudo2ref.pickle.gz > /home/scratch30/rmercuri_14_jul/output/teflon/teflon.prep_TF/$
eflon.pseudo2ref.pickle.gz.tmp
loading pickle: /home/scratch30/rmercuri_14_jul/output/teflon/teflon.prep_TF/teflon.pseudo2ref.pickle.gz.tmp
NOTE: this step can be time and memory intensive for large reference genomes
pickle loaded!
coming soon...use "pooled" to obtain presence and absence read counts
TEFLON GENOTYPE FINISHED!

Also, can you confirm that your were able to successfully run TELoN on the example files provided in this repo? Yes, the example works fine.

jradrion commented 4 years ago

Hmm, OK, I'm not exactly sure what the problem is. One thing I noticed is that it looks like you are pointing to a different directory -d when you run teflon.v0.4.py relative to when you run teflon_collapse.py, teflon_count.py, and teflon_genotype.py. Also did you sort and index your .sam file? I'm not seeing that in your original list of commands.

Can you try modifying the template provided here https://github.com/jradrion/TEFLoN/blob/master/test_files/sample_pipeline.sh with your data, and run the commands that way, just so I make sure we're on the same page.

jradrion commented 3 years ago

I'm going to go ahead and close this issue since there haven't been any updates in about 6 months. Feel free to reopen it if necessary.