Issue during htseq_cnt job: [E::sam_parse1] hex field does not have an even number of digits

boehmv commented 1 year ago

Hi Isaac, first of all: thanks for providing your cool bakR package, I was just looking for such an approach when I saw your publication in RNA and promptly wanted to check it out.

Disclaimer: I am by far not an expert in bioinformatics or snakemake, so please forgive me if some questions/comments are "stupid".

I have a few minor comments and one major problem right now.:

First of all, when using the provided config.yaml as basis for my own run, I noticed there was a typo at line 101: "flattened" instead of "flattend", which needs to be corrected in order to run properly.
Probably wrongly I assumed that all the information below line 70 in the config.yaml ("##### Parameters that are only relevant if bam2bakr is False #####") are not relevant as long as I provide BAM files, but apparently the htseq_cnt job uses the "flattened" parameter for parsing the provided annotation file? I used the gencode.v42 annotation supplemented with spike-Ins (SIRVomeERCCome) and had to set the flattened parameter to false in order to get meaningful results.
Amateur question: How do I restart only the htseq_cnt job without re-running the sort_filter job?
However, I am currently stuck at the "# Combine outputs from individual jobs" step in htseq.sh, which always throws the error:

[E::sam_parse1] hex field does not have an even number of digits

Consequently, samtools sort stops and no final "results/htseq/{sample}_tl.bam" file is produced. Do you have any idea what the problem could be?

The log file for this job is attached below.

Thanks in advance and best wishes, Volker

Nter_24h_IAA_4SU_2.log

isaacvock commented 1 year ago

Hi Volker,

Thank you very much for your interest in bakR!

I will address your minor comments first. In regards to the error you are running into, it is not one I have personally every seen but I'll do my best to try and help you diagnose the problem.

Thank you for catching that typo!
You have caught bam2bakR at a bit of an awkward transition period as I was rushing to release new functionality yesterday in preparation for our metabolic labeling methodology preprint that was released today. flattened is an experimental parameter that should be set to FALSE for now, and that I will likely remove from the main branch shortly so that I can continue work on it separately. The parameter is to allow alignment to a DEXSeq flattened annotation, which requires a change in HTSeq's behavior downstream of read alignment. It's a bit of a gray area as I consider it a part of the larger "fastq2bakR" pipeline, though it is admittedly confusing that it affects a bam2bakR step.
Correct me if I am wrong, but I assume you are asking because you are using the preferred strategy for running bam2bakR (i.e., with SnakeDeploy) and it is rerunning the sort_filter step everytime you run the pipeline, even if the output of the sort_filter step has not been modified since the last run? In that case, the solution is to add --rerun-triggers mtime to your call to snakemake (e.g., snakemake --cores all --use-conda --rerun-triggers mtime). The details of why you have to do this are discussed here. bam2bakR requires Snakemake to temporarily copy some shell scripts, which newer versions of Snakemake view as a change to the input to steps using such shell scripts. --rerun-triggers mtime makes it so that jobs are only rerun if the output of that job is missing or has been modified since the last run.

In regards to your HTSeq error:

What is the output of samtools view <one of the .s.sam files from sort_filter output> | head -n 10 and head <same .s.sam file> -n 10? I want to make sure that nothing went wrong in the upstream step of bam2bakR.
How did you align your fastq files?
Would you be willing to share a downsampled version of one of your bam files to help me more easily reproduce the error? Perhaps with only 1% of the reads (which can be generated by running samtools view -b -s 0.01 <input bam file>.bam > <subsampled bam name>.bam) or less.

boehmv commented 1 year ago

Thank you very much for the super fast response and for willing to help me troubleshoot, much appreciated!

Concerning all the minor points: makes sense, thanks for the explanation and --rerun-triggers mtime works fine (although it still generates the [E::sam_parse1] problems mentioned above), but it skips now the sort_filter step (yeah!)

What is the output of samtools view <one of the .s.sam files from sort_filter output> | head -n 10 and head <same .s.sam file> -n 10? I want to make sure that nothing went wrong in the upstream step of bam2bakR.

I have run both commands on one of the bam files, please find the output here: samtools_head.txt head_plain.txt

How did you align your fastq files?

I have pasted the command in this text file, I hope the most relevant parameters are understandable: Align.txt

Would you be willing to share a downsampled version of one of your bam files to help me more easily reproduce the error? Perhaps with only 1% of the reads (which can be generated by running samtools view -b -s 0.01 .bam > .bam) or less.

I had to subsample 0.5% of the reads to pass the size limit, please find the bam file here: Subsampled.bam.gz

Once again: thanks for your support

isaacvock commented 1 year ago

No problem, I'm sorry that you have to deal with errors in someone else's tool, I know how annoying that can be.

Thank you for providing all of the requested files, nothing looks out of the ordinary so I will try running your downsampled bam file. Can you also send me the content of your config.yaml file to make sure I specify everything correctly?

boehmv commented 1 year ago

Sure, here you go (just renamed to txt since GitHub would not allow otherwise):

config.yaml.txt

isaacvock commented 1 year ago

Thank you, I was able to reproduce, diagnose, and (I think) fix the error. The fix is now implemented on the main branch, though before running it you will have to add one parameter to your config file: remove_tags: True. Below I will briefly discuss the problem and the current solution. Let me know if this admittedly hacky solution will not be appropriate for your particular use case. Also let me know if the fix doesn't work for you, or if you run into any other problems.

The problem is that HTseq is misprocessing some of the non-standard tags in your bam files, namely the jI and jM tags. I guess this is a common problem, as the STAR manual states: "jM jI attributes require samtools 0.1.18 or later, and were reported to be incompatible with some downstream tools such as Cufflinks."

It might be possible for me to edit the HTSeq python script to get around this issue, but for now I just wrote a python script to remove those tags from bam files, and added the tag removal step as an optional step that can be included by setting the new config parameter remove_tags to True.

boehmv commented 1 year ago

Hi Isaac,

quick feedback: using the remove_tags: True parameter and re-running the pipeline worked without any further problems! Thank you very much for fixing this issue that quickly.

I am currently looking at the output from bakR and although I probably need to read you mansucript more carefully, it looks really promising.

Along those lines: I very much appreciate your efforts to a) create a tool like that and b) for the extensive documentation. I know how much pain that can be, but except for this "tag" problem, everything else was perfect and very nicely documented.

simonlabcode / bam2bakR

Issue during htseq_cnt job: [E::sam_parse1] hex field does not have an even number of digits #3