jfear / ncbi_remap

This is the drosSRA project, where we are remapping all Drosophila melanogaster RNA-seq data to FlyBase release 6 and updating annotations.
2 stars 1 forks source link

Do I really want bam stat? #27

Closed jfear closed 7 years ago

jfear commented 7 years ago

Need to decide if bam stat is really useful. If I remove bam stat then I can remove rseqc as a dependency.

jfear commented 7 years ago

This is related to #6, where I decided to keep both. However, now I am more motivated to remove bam_stat because of the rseqc dependencies.

Here is the samtools stat and bam stat output:

'bam_stat': {
    'non_unique': 292001,
    'nonsplice_reads': 3841210,
    'optical_pcr_duplicates': 0,
    'proper_pair_map_to_different_chrom': 0,
    'qc_failed': 0,
    'read_1': 2098571,
    'read_2': 1773091,
    'reads_map_minus': 1935259,
    'reads_map_plus': 1936403,
    'reads_mapped_proper_pairs': 3238278,
    'splice_reads': 30452,
    'total_records': 35575122,
    'unique': 3871662,
    'unmapped_reads': 30692893
},
'samtools_stats': {
    '1st_fragments': 17428278.0,
    'average_length': 143.0,
    'average_quality': 33.6,
    'bases_duplicated': 0.0,
    'bases_mapped': 585920955.0,
    'bases_mapped_(cigar)': 582670594.0,
    'bases_trimmed': 0.0,
    'filtered_sequences': 0.0,
    'insert_size_average': 379.6,
    'insert_size_standard_deviation': 891.1,
    'inward_oriented_pairs': 1339629.0,
    'is_sorted': 1.0,
    'last_fragments': 17428278.0,
    'maximum_length': 151.0,
    'mismatches': 3540532.0,
    'non-primary_alignments': 718566.0,
    'outward_oriented_pairs': 10201.0,
    'pairs_on_different_chromosomes': 18735.0,
    'pairs_with_other_orientation': 1418.0,
    'raw_total_sequences': 34856556.0,
    'reads_MQ0': 21224.0,
    'reads_QC_failed': 0.0,
    'reads_duplicated': 0.0,
    'reads_mapped': 4163663.0,
    'reads_mapped_and_paired': 3555266.0,
    'reads_paired': 34856556.0,
    'reads_properly_paired': 3478458.0,
    'reads_unmapped': 30692893.0,
    'sequences': 34856556.0,
    'total_length': 4996041946.0
},
jfear commented 7 years ago

I also have similar picard and hisat2 output:

'picard_collectrnaseqmetrics': {
    'first': {
        'CODING_BASES': 183959307.0,
        'CORRECT_STRAND_READS': 427095.0,
        'IGNORED_READS': 0.0,
        'INCORRECT_STRAND_READS': 388979.0,
        'INTERGENIC_BASES': 166530170.0,
        'INTRONIC_BASES': 190018903.0,
        'LIBRARY': nan,
        'MEDIAN_3PRIME_BIAS': 0.0,
        'MEDIAN_5PRIME_BIAS': 0.001894,
        'MEDIAN_5PRIME_TO_3PRIME_BIAS': 0.0,
        'MEDIAN_CV_COVERAGE': 2.418909,
        'NUM_R1_TRANSCRIPT_STRAND_READS': 145031.0,
        'NUM_R2_TRANSCRIPT_STRAND_READS': 125440.0,
        'NUM_UNEXPLAINED_READS': 172334.0,
        'PCT_CODING_BASES': 0.31579,
        'PCT_CORRECT_STRAND_READS': 0.523353,
        'PCT_INTERGENIC_BASES': 0.28586999999999996,
        'PCT_INTRONIC_BASES': 0.326192,
        'PCT_MRNA_BASES': 0.387938,
        'PCT_R1_TRANSCRIPT_STRAND_READS': 0.536216,
        'PCT_R2_TRANSCRIPT_STRAND_READS': 0.46378400000000003,
        'PCT_RIBOSOMAL_BASES': nan,
        'PCT_USABLE_BASES': 0.045233999999999996,
        'PCT_UTR_BASES': 0.072149,
        'PF_ALIGNED_BASES': 582537692.0,
        'PF_BASES': 4996041946.0,
        'READ_GROUP': nan,
        'RIBOSOMAL_BASES': nan,
        'SAMPLE': nan,
        'UTR_BASES': 42029312.0
    },

'hisat2': {
    'num_concordant_multimappers': 394703.0,
    'num_concordant_reads_unaligned': 15689049.0,
    'num_concordant_reads_uniquely_aligned': 1344526.0,
    'num_discordant_reads_aligned': 18408.0,
    'num_multimappers': 151066.0,
    'num_reads': 17428278.0,
    'num_reads_paired': 17428278.0,
    'num_unaligned': 30692893.0,
    'num_uniquely_algined': 497323.0,
    'per_alignment': 11.95
},
jfear commented 7 years ago

Looking at these results, I don't think there is any added value in having bam_stat. I am actually a little confused by its output, because only the unaligned counts match with other tools. The only value is would add is the number of spliced reads, which I don't think is needed for this project.