Closed faricazjj closed 2 years ago
I confirm @dominikburri 's "anecdotal" observation, that samples seem to map okish to chr1, but on other chromosomes either don't map or totally mismatch. I looked at SRR6795718, SRR6795713 (both Mayr) and SRR1573494 (NOT Mayr, so the samples don't seem to be at fault).
For the re-mapped files, can we get some mapping statistics and quality reports? Are you using nf-core/rnaseq? They should have these kind of reports, don't they? If not, you might want to consider using a different pre-processing pipeline, as I think it is important (especially for a benchmarking paper) to provide provenance and quality information about the utilized data. Along the same lines, we need to know exactly which genome version/files have been used.
@ninsch3000 Yes the problem was with the headers in the bam files. The provenance of the files uploaded to the public facing server is questionable (we have moved around form computing resources many times in the past year), it is more than likely the wrong set of files were grabbed.
I have been in the process of remapping everything with nf-core/rnaseq again. I am writing up a good description / documentation of exactly how things are run, etc. and will post a full description with links to the results above soon.
Mapping is complete. I will upload and review the MultiQC results (which nicely collects many QC metrics in one spot) and update the bam files on the public facing server tomorrow
Awesome! Love the MultiQC reports @mrgazzara ! They're also really useful to expand the data section of the manuscript a little more, as we can nicely see how the different datasets cover different biotypes, genomic origins/read distributions, etc.
Updates on QC:
reverse
strandedness, with the suggestion it was forward
. My hand checking confirmed that this was correct: This dataset is in fact forward
strandeded. This dataset has been remapped, the strandedness column in the Sample Summary GSheet has been updated, and the MultiQC results for the correct forward
stranded alignments are in the MultiQC results GDrive directory. forward
strandedness, with the suggestion it was unstranded
. My checking of this suggests the data is forward
stranded and I am unsure why it was flagged. I have documented my checks (running the pipeline with forward
, reverse
, and unstranded
and checking alignment Biotypes, looking at stranded bigWig tracks, etc.) with results documented in a section of the Mapping GDoc. I am leaving it as forward
stranded for now. It is possible that the simulated reads were generated in some weird way and perhaps some tools may complain about this, but we will raise that issue later, if necessary. Final thing to do before closing for now is wait for the uploads to finish.
Upload is complete. Finalized bam files can be found here: https://majiq.biociphers.org/data/apaeval/bams/ with description of samples in the datasets GSheet RNA-seq tab.
Closing this issue!
Due to questionable provenance of the alignment files uploaded previously to the publicly facing server, we have decided to re-map all RNA-seq samples described in the summary table GSheet.
I have taken notes on the specific steps taken in this final re-alignment using
nextflow
(v21.10.6.5660, run form environment created by theapaeval_execution_workflows.yml
)nf-core/rnaseq
(v3.8.1) which can be found in this GDoc. Below I briefly describe the exact annotations used and options used with alignment.Annotations / Genome Data used
Running nf-core/rnaseq The command to run a dataset looks something like the following (for the mayr dataset):
nextflow run nf-core/rnaseq --input /data/apaeval/nf_rnaseq/mayr_samplesheet.csv --outdir /data/apaeval/nf_rnaseq/mayr/ -profile docker --aligner star_salmon --save_reference --gencode --fasta /data/apaeval/genome_data/GRCh38.primary_assembly.genome.fa.gz --gtf /data/apaeval/genome_data/gencode.v38.annotation.gtf.gz --save_trimmed --skip_markduplicates --skip_stringtie --save_unaligned --skip_bbsplit
All commands used to run the various datasets can be found within the GDoc starting here
All dataset sample sheets can be found in this GDrive directory
Progress on alignments of datasets
Alignment Quality This should have been looked at before, but I will update this issue with links to the QC and alignment qualities once they are run and I can look through for the most applicable summary / outputs. MultiQC updates for each dataset are available in a shared GDrive directory here.