jradrion / TEFLoN

TEFLoN uses paired-end illumina sequence data to discover and genotype transposable elements present in your samples.
13 stars 7 forks source link

Index BAM with BWA or Samtools? #4

Closed hzpc-joostk closed 4 years ago

hzpc-joostk commented 4 years ago

Hi @jradrion

We are trying your pipeline for TE discovery in potato (tetraploid). However we face a few issues. Hereunder one.

Should we index an alignment with bwa index or samtools index?

https://github.com/jradrion/TEFLoN/blob/f6a4dbc6f06e956a48883811c3fc132dd64c3595/teflon.v0.4.py#L352-L355

https://github.com/jradrion/TEFLoN/blob/f6a4dbc6f06e956a48883811c3fc132dd64c3595/teflon.v0.4.py#L389-L392

AFAIK it is samtools, but trying it with BWA does not return an error. However, the result does not make sense. :-)

jradrion commented 4 years ago

Hi @hzpc-joostk

Hmm, this is odd. I have not looked at this code in quite some time, but you are correct, I would think that indexing of the alignment should be done with samtools and not BWA. I'm not sure what is going on here. I'm in the middle of a pretty hectic week, but I will dig in to this and try to figure it out as soon as I have the chance. I'll need to take a closer look.

hzpc-joostk commented 4 years ago

Thanks @jradrion I forked your project and try to fix minor issues we face. I'll keep you posted. :-)

jradrion commented 4 years ago

Thank you, @hzpc-joostk. I appreciate you raising the issue. I have not run this code in a really long time (hence python 2). Have you actually tried running the code as-is? When you said "the result does not make sense", I'm wondering if you meant that the result from running Teflon doesn't make sense or simply using BWA to index an alignment doesn't make sense?

I haven't checked, but I almost wonder if the resulting index generated by BWA is ever used downstream, and that's why nothing throws an error?

hzpc-joostk commented 4 years ago

the result does not make sense

What I meant here was that the results of bwa index some_alignment.bam does not make sense. 😉 It might be that the BAM's index is never used indeed..

The pipeline does not complete for whatever reason. So we cannot yet comment on the results. 😉 And to me it seems Python's tracebacks are messed up due to multiprocessing. But I am also looking into that issue.

Anyway, we have setup an environment using Anaconda/Miniconda to get the dependencies in place:

conda create --prefix ./TEFLoN-env --channel bioconda python=2.7 samtools=1.3 bwa=0.7
conda activate ./TEFLoN-env

Cloned your repo's master branch (commit f6a4dbc6f06e956a48883811c3fc132dd64c3595). Then run the tools in following order (trimmed the arguments):

python teflon_prep_annotation.py
bwa index
bwa mem | samtools view | samtools sort
samtools index
python teflon.v0.4.py
python teflon_collapse.py
python teflon_count.py
python teflon_genotype.py
jradrion commented 4 years ago

Hi @hzpc-joostk,

I apologize for that it's taken me so long to respond; I got sidetracked and completely forgot about this!

OK, so those indexes are indeed never used, and to be honest I'm not sure why they were ever created in the first place. I removed those lines.

The pipeline does not complete for whatever reason.

Your conda environment looks fine to me. I just set up a similar environment on my machine and was able to run everything without issue. I made a new file in ./test_files/ called sample_pipeline.sh. This should allow you to test if you can run the pipeline to completion on the test files? Can you try pulling the changes and let me know if you continue to get errors.

hzpc-joostk commented 4 years ago

Hi @jradrion! No worries on the delay. 😉

Thanks! I will pull the latest code, test and give it a go.

hzpc-joostk commented 4 years ago

@jradrion the pipeline seems to run the sample data without any problems! Nice.

.
├── countPos
│   ├── sample1.all_positions_sorted.collapsed.txt
│   ├── sample1.all_positions_sorted.txt
│   ├── sample1.all_positions.txt
│   ├── sample1.counts.txt
│   ├── sample1.pseudoSpace.genotypes.txt
│   ├── union_sorted.collapsed.txt
│   ├── union_sorted.txt
│   └── union.txt
├── genotypes
│   └── sample1.genotypes.txt
└── reference
    ├── TEST.prep_MP
    │   ├── TEST.annotatedTE.fa
    │   ├── TEST.mappingRef.fa
    │   ├── TEST.mappingRef.fa.amb
    │   ├── TEST.mappingRef.fa.ann
    │   ├── TEST.mappingRef.fa.bwt
    │   ├── TEST.mappingRef.fa.pac
    │   ├── TEST.mappingRef.fa.sa
    │   └── TEST.pseudo.fa
    ├── TEST.prep_TF
    │   ├── TEST.genomeSize.txt
    │   ├── TEST.hier
    │   ├── TEST.pseudo2ref.pickle.gz
    │   ├── TEST.ref2pseudo.pickle.gz
    │   └── TEST.te.pseudo.bed
    ├── TEST.sorted.bam
    ├── TEST.sorted.bam.bai
    ├── TEST.sorted.cov.txt
    ├── TEST.sorted.stats.txt
    ├── TEST.sorted.subsmpl.bam
    ├── TEST.sorted.subsmpl.bam.bai
    ├── TEST.sorted.subsmpl.cov.txt
    └── TEST.sorted.subsmpl.stats.txt

5 directories, 30 files

I will proceed testing with our data.

jradrion commented 4 years ago

Good to hear! I'm going to go ahead and close this issue now. Feel free to open a new one if you have any issues at a later point.