Closed jfear closed 7 years ago
This is related to #6, where I decided to keep both. However, now I am more motivated to remove bam_stat because of the rseqc dependencies.
Here is the samtools stat and bam stat output:
'bam_stat': {
'non_unique': 292001,
'nonsplice_reads': 3841210,
'optical_pcr_duplicates': 0,
'proper_pair_map_to_different_chrom': 0,
'qc_failed': 0,
'read_1': 2098571,
'read_2': 1773091,
'reads_map_minus': 1935259,
'reads_map_plus': 1936403,
'reads_mapped_proper_pairs': 3238278,
'splice_reads': 30452,
'total_records': 35575122,
'unique': 3871662,
'unmapped_reads': 30692893
},
'samtools_stats': {
'1st_fragments': 17428278.0,
'average_length': 143.0,
'average_quality': 33.6,
'bases_duplicated': 0.0,
'bases_mapped': 585920955.0,
'bases_mapped_(cigar)': 582670594.0,
'bases_trimmed': 0.0,
'filtered_sequences': 0.0,
'insert_size_average': 379.6,
'insert_size_standard_deviation': 891.1,
'inward_oriented_pairs': 1339629.0,
'is_sorted': 1.0,
'last_fragments': 17428278.0,
'maximum_length': 151.0,
'mismatches': 3540532.0,
'non-primary_alignments': 718566.0,
'outward_oriented_pairs': 10201.0,
'pairs_on_different_chromosomes': 18735.0,
'pairs_with_other_orientation': 1418.0,
'raw_total_sequences': 34856556.0,
'reads_MQ0': 21224.0,
'reads_QC_failed': 0.0,
'reads_duplicated': 0.0,
'reads_mapped': 4163663.0,
'reads_mapped_and_paired': 3555266.0,
'reads_paired': 34856556.0,
'reads_properly_paired': 3478458.0,
'reads_unmapped': 30692893.0,
'sequences': 34856556.0,
'total_length': 4996041946.0
},
I also have similar picard and hisat2 output:
'picard_collectrnaseqmetrics': {
'first': {
'CODING_BASES': 183959307.0,
'CORRECT_STRAND_READS': 427095.0,
'IGNORED_READS': 0.0,
'INCORRECT_STRAND_READS': 388979.0,
'INTERGENIC_BASES': 166530170.0,
'INTRONIC_BASES': 190018903.0,
'LIBRARY': nan,
'MEDIAN_3PRIME_BIAS': 0.0,
'MEDIAN_5PRIME_BIAS': 0.001894,
'MEDIAN_5PRIME_TO_3PRIME_BIAS': 0.0,
'MEDIAN_CV_COVERAGE': 2.418909,
'NUM_R1_TRANSCRIPT_STRAND_READS': 145031.0,
'NUM_R2_TRANSCRIPT_STRAND_READS': 125440.0,
'NUM_UNEXPLAINED_READS': 172334.0,
'PCT_CODING_BASES': 0.31579,
'PCT_CORRECT_STRAND_READS': 0.523353,
'PCT_INTERGENIC_BASES': 0.28586999999999996,
'PCT_INTRONIC_BASES': 0.326192,
'PCT_MRNA_BASES': 0.387938,
'PCT_R1_TRANSCRIPT_STRAND_READS': 0.536216,
'PCT_R2_TRANSCRIPT_STRAND_READS': 0.46378400000000003,
'PCT_RIBOSOMAL_BASES': nan,
'PCT_USABLE_BASES': 0.045233999999999996,
'PCT_UTR_BASES': 0.072149,
'PF_ALIGNED_BASES': 582537692.0,
'PF_BASES': 4996041946.0,
'READ_GROUP': nan,
'RIBOSOMAL_BASES': nan,
'SAMPLE': nan,
'UTR_BASES': 42029312.0
},
'hisat2': {
'num_concordant_multimappers': 394703.0,
'num_concordant_reads_unaligned': 15689049.0,
'num_concordant_reads_uniquely_aligned': 1344526.0,
'num_discordant_reads_aligned': 18408.0,
'num_multimappers': 151066.0,
'num_reads': 17428278.0,
'num_reads_paired': 17428278.0,
'num_unaligned': 30692893.0,
'num_uniquely_algined': 497323.0,
'per_alignment': 11.95
},
Looking at these results, I don't think there is any added value in having bam_stat. I am actually a little confused by its output, because only the unaligned counts match with other tools. The only value is would add is the number of spliced reads, which I don't think is needed for this project.
Need to decide if bam stat is really useful. If I remove bam stat then I can remove rseqc as a dependency.