jts / sga

de novo sequence assembler using string graphs
http://genome.cshlp.org/content/22/3/549
237 stars 82 forks source link

sga depends on some abyss programs, no errors to indicate this #24

Closed gringer closed 12 years ago

gringer commented 12 years ago

It might be worth noting somewhere that sga (or more specifically sga-bam2de.pl) depends on two programs from ABySS, abyss-fixmate and DistanceEst. On my debian system I was able to install abyss, but needed to modify this file to point at '/usr/lib/abyss/abyss-fixmate' and '/usr/lib/abyss/DistanceEst' for the functions to work (these programs are not in the default path). No errors were produced to indicate that these programs were missing, so it took a while for me to work out why my scaffolding wasn't joining any contigs.

FWIW, it might be possible to remove the dependance on abyss-fixmate without too much additional work. Generating the histogram of average distances (e.g. pe.hist) can be done fairly fast with a combination of samtools and awk, using the 'view' command to filter on the first read of a pair when the pairs are properly mapped (bowtie2 seems to define this as correct orientation with not too much distance between reads):

samtools view -f 0x42 mappedreads.bam | awk '{print sqrt($9*$9)}' | sort -n | uniq -c | sort -k 2,2n | awk '{print $2"\t"$1}' > pe.hist

Creating the contig distance file (e.g. pe.de) will probably require things beyond simple command line pipes. I can generate a sorted BAM file containing only pairs with different contigs:

(samtools view -H mappedreads.bam; samtools view -F 0x02 mappedreads.bam | cut -f 1-11 | awk -v 'OFS=\t' '{if($7 != "="){$1="";$10="";$11="*";print $0}}') | samtools view -Sb - | samtools sort - pe.diffcontigs.sorted

However, altering the template length (field 9) would (I expect) need a knowledge of the most likely read-pair distance.

jts commented 12 years ago

Thanks for reporting this. In 79c6c969ffb I have added a dependency check for these two abyss programs. I am going to leave abyss-fixmate in the pipeline since I depend on DistanceEst already. Setting the template length is critical for DistanceEst.