Closed JOl-lN closed 3 years ago
@john-s-f thanks so much, I've added you to the riboviz developers team so you should be able to create a branch.
In another situation like this, you could fork the repository to your own github, create the branch there, then put in a pull request.
At 9 Sept dev meeting, we suggested that @acope3 could attack this issue, after successfully putting a yeast dataset through.
This would be the first bacterial dataset in example-datasets.
@FlicAnderson before I start working on this, did you manage to make any progress on this issue?
@acope3 no, I don't think I got started, only as far as downloading the files in preparation for starting but got sidetracked by other issues.
Alright, I think I figured out part of the issue that @john-s-f was running into. For those of you who haven't looked at the link he posted, he was receiving the error "Error in stop_codon_loc:(stop_codon_loc + 2) : argument of length 0" when running bam_to_h5.R. The issue appears to be due to the structure of the GFF file he was using and the index being built by hisat2. I believe hisat2 expects the sequence ids (often chromosome ids) to be in the first column of the the GFF file. However, @john-s-f's file just has REL606 (the strain of the bacteria). This is in contrast to the riboviz style GFF files, which have the ids in this first column. When we try to map the reads to individual nucleotides in bam_to_h5.R, this results in the variable outputList
that looks like this for each item in the list:
[1] "Error in .io_bam(.scan_bamfile, file, reverseComplement, yieldSize(file), : \n seqlevels(param) not in BAM header:\n seqlevels: ‘REL606’\n file: /data/cope/ecoli/output/AraM1/AraM1.bam\n index: /data/cope/ecoli/output/AraM1/AraM1.bam\n" attr(,"class") [1] "try-error" attr(,"condition") <simpleError in .io_bam(.scan_bamfile, file, reverseComplement, yieldSize(file), tmpl, obeyQname(file), asMates(file), qnamePrefix, qnameSuffix, param = param): seqlevels(param) not in BAM header: seqlevels: ‘REL606’ file: /data/cope/ecoli/output/AraM1/AraM1.bam index: /data/cope/ecoli/output/AraM1/AraM1.bam>
Note that seqlevels only has REL606.
I will also note that while trying to figure out what is going on, some ids in the GFF file appear more than once. For example, "Name" might appear twice for the same CDS, like "ID=ECB_00001.p01;Name=ECB_00001.p01;Parent=ECB_00001.t01;Name=thrL". This could lead to other problems down the road.
We discussed this at 30th September meeting. The problem is probably that @acope3 is running with a genome fasta and genome-centric .gff, instead of the (approximate) transcriptome .fasta and transcript-centric .gff.
Currently we have a checking script, check_fasta_gff.py. If your fasta and gff file pass that script, then they should work in riboviz. @mikej888 is developing and testing that script in riboviz#74, and any feedback on that is helpful.
Earlier versions of a script for creating transcript fasta/gff are in script_for_transcript_annotation.Rmd. @mikej888 is working on a python-implemented improvement to that, after the checking script which will test its output.
We discussed this at 30th September meeting. The problem is probably that @acope3 is running with a genome fasta and genome-centric .gff, instead of the (approximate) transcriptome .fasta and transcript-centric .gff.
I believe I was incorrect/unclear on my previous comment and during the meeting today. The fasta file I used is transcriptome-based, which I take to mean each sequence represents a CDS. However, the GFF file provided which resulted in the error is not like Riboviz-style GFF file with the CDS id in the first column. Instead, the first column of the GFF simply uses "REL606," with the CDS id used in the FASTA file found in another column. I think the issue is actually in the following line from bam_to_h5.R:
geneseqname <- as.character(seqnames(gene_location)[1])
Given this GFF file, this is setting the geneseqname to the value found in the first column (always "REL606"), which is obviously not in the .bam file from the mapping to the transcriptome-centric fasta file. This is leading to scanBam errors.
Error in .io_bam(.scan_bamfile, file, reverseComplement, yieldSize(file), : seqlevels(param) not in BAM header: seqlevels: ‘REL606’ file: /data/cope/ecoli/output/AraM1/AraM1.bam index: /data/cope/ecoli/output/AraM1/AraM1.bam Calls: mclapply ... scanBam -> scanBam -> scanBam -> scanBam -> .io_bam
The hard-coded value of 1 for the index seems to assume a Riboviz-style GFF. Perhaps this was accidentally left over when someone was testing this code?
Argh! We've been avoiding bam_to_h5.R
and dealing with everything else, for the last year, see riboviz#23.. So this may take some thought to fix.
This issue may be related to riboviz#171.
Instead of using a numeric index, could we allow the user to specify the id to pull from the gff file in the case of a non-riboviz style gff? For example, the CDS file I'm using for this dataset uses the locus_id for each CDS. My opinion is this would give the user the greatest flexibility and this would require relatively few changes to bam_to_h5.R (I think). Does this seem reasonable to everyone else? Am I missing something?
Regarding the mapping of bacterial Ribo-Seq datasets, Green and Buskirk's work suggests that mapping to the 3'-end is preferred. However, my current understanding of the code is that all reads are assigned to the nucleotide at the 5'-end of the read and the A-site displacement is then applied by adding some number of nucleotides to the 5'-end. I think a simple solution for calculating the A-site position for bacterial riboseq datasets without changing how the mapping is done is to modify CalcAsiteFixedOneLength in read_counts_functions.R to check if the value for asite_displacement is negative and, if so, determine the A-site by shifting by the sum of the read_length and the asite_displacement. Does that make sense and what does everyone think? This seems like a more straightforward approach than changing how the mapping is done, but that does not mean it is the best solution.
There's a workaround solution to 3'-end mapping that doesn't require us to rewrite code. Basically we just hard-code the 3' displacement in the read length mapping file, for example if the A-site were 10nt from the 3'-end then:
read_length asite_disp
20 10
21 11
22 12
23 13
24 14
25 15
26 16
27 17
28 18
29 19
20 20
I didn't find the exact distance from 3'end to A-site explicitly stated on a quick re-read of Allen Buskirk's paper A systematically-revised ribosome profiling method for bacteria. Maybe it's in an earlier paper? And this is a place where it would be really easy to make off-by-one errors. But I think the approach should work as long as we get the numbers right.
That's interesting, if Alex has sorted out why my e coli data wasn't working and we can do 3' mapping then I'd definitely want to try that with my bacterial data. The github repo for that paper has this python notebook which under the Analysis Settings chunk has a line settings['shift'] = 11 # A-site shift
which makes it seem like they're only doing a shift of 11.
@ewallace I recall their paper suggests a 12 nucleotide shift from the 3' end, but I'm not sure if this is including or excluding the base at the 3'-end. If my memory is correct, then the python notebook found by @john-s-f might indicate the 12 nucleotide shift actually includes the nucleotide at the 3'-end.
@john-s-f informed me that this dataset is currently unpublished (although it will be submitted soon). Should we hold off on adding it to example-datasets until the data is publicly available? I can work with another E. coli dataset in the meantime.
Either is fine with me. It's not a big problem if we temporarily have an unpublished dataset in example-datasets
repository, we should just indicate in the config.yaml
that it's unpublished as yet.
Getting one of the Allen Buskirk E. coli datasets in would be amazing.
Submitted pull request #25.
Request @acope3 checks the fasta/gff files here with check_fasta_gff.py
, just so we know.
The latest version of riboviz.tools_check_fasta_gff
is in the gff-spec-74
branch.
I ran the latest version of check_fasta_gff.py
for the FASTA and GFF files I used to successfully complete this run. It ran to completion, but it output that each CDS doesn't start with ATG. This is because the CDS file I got from John has 25 nucleotides appended to represent the 5'-UTR. I guess the question is do we want check_fasta_gff.py
to assume UTRs are included in the FASTA file or not? We could also make it flexible enough to handle all of these situations if that's worth the effort.
Hi Alex,
The flexibility is in the gff file. If you have the CDS annotated starting at nt26 of the fasta, then check_fasta_gff.py
will detect the start codon in the place you want it. That also propagates the information to the counts later on. So a different-length UTR is specified by the CDS co-ordinates in the .gff
I don't think we have this documented clearly, that's #74. We mention it in run-vignette.md, but not very very clearly. It also needs a picture TBH.
Could you post the first ~10 lines of your .gff and .fasta file here so we can take a look? And let's discuss at tomorrow's dev meeting.
Gah, I really shouldn't comment before I've had my morning coffee. There was a large amount of output which at a quick glance made me think there was some issue with the code. Not true. As far as I can tell, the code is working as expected. The real issue is E. coli uses codons GTG (valine) and TTG (leucine) as start codons fairly frequently. When the code translates the start codon to its amino acid, it thinks the CDS is starting with a V or L.
And let's discuss at tomorrow's dev meeting.
Noted.
check_fasta_gff
needs extended to allow non-canonical start codons as arguments (e.g. GTG or TTG (V or L after translation). It would make the code clearer too if we checked for start codons directly rather than their translations. Additional functionality for riboviz/74.
Can soon upload to example datasets, @acope3 proposed at the call.
@shahpr we think we can talk to Allen Buskirk about this; Alex thinks we may be off by one, so this could be a good starting point for the conversation.
I don't think I have permissions to create new branches. For now, I've placed all the things @FlicAnderson theoretically needs to recreate the issues I'm having https://rutgers.box.com/s/3d0se6r1g48tb35o3xc0l1mh3rhttu51. There's a description of the files in there. Once I have permissions and this data works properly I'll add it in as a new branch.