Solve issue with and add e. coli data

JOl-lN commented 4 years ago

I don't think I have permissions to create new branches. For now, I've placed all the things @FlicAnderson theoretically needs to recreate the issues I'm having https://rutgers.box.com/s/3d0se6r1g48tb35o3xc0l1mh3rhttu51. There's a description of the files in there. Once I have permissions and this data works properly I'll add it in as a new branch.

ewallace commented 4 years ago

@john-s-f thanks so much, I've added you to the riboviz developers team so you should be able to create a branch.

In another situation like this, you could fork the repository to your own github, create the branch there, then put in a pull request.

ewallace commented 4 years ago

At 9 Sept dev meeting, we suggested that @acope3 could attack this issue, after successfully putting a yeast dataset through.

This would be the first bacterial dataset in example-datasets.

acope3 commented 4 years ago

@FlicAnderson before I start working on this, did you manage to make any progress on this issue?

FlicAnderson commented 4 years ago

@acope3 no, I don't think I got started, only as far as downloading the files in preparation for starting but got sidetracked by other issues.

acope3 commented 4 years ago

Alright, I think I figured out part of the issue that @john-s-f was running into. For those of you who haven't looked at the link he posted, he was receiving the error "Error in stop_codon_loc:(stop_codon_loc + 2) : argument of length 0" when running bam_to_h5.R. The issue appears to be due to the structure of the GFF file he was using and the index being built by hisat2. I believe hisat2 expects the sequence ids (often chromosome ids) to be in the first column of the the GFF file. However, @john-s-f's file just has REL606 (the strain of the bacteria). This is in contrast to the riboviz style GFF files, which have the ids in this first column. When we try to map the reads to individual nucleotides in bam_to_h5.R, this results in the variable outputList that looks like this for each item in the list:

[1] "Error in .io_bam(.scan_bamfile, file, reverseComplement, yieldSize(file), : \n seqlevels(param) not in BAM header:\n seqlevels: ‘REL606’\n file: /data/cope/ecoli/output/AraM1/AraM1.bam\n index: /data/cope/ecoli/output/AraM1/AraM1.bam\n" attr(,"class") [1] "try-error" attr(,"condition") <simpleError in .io_bam(.scan_bamfile, file, reverseComplement, yieldSize(file), tmpl, obeyQname(file), asMates(file), qnamePrefix, qnameSuffix, param = param): seqlevels(param) not in BAM header: seqlevels: ‘REL606’ file: /data/cope/ecoli/output/AraM1/AraM1.bam index: /data/cope/ecoli/output/AraM1/AraM1.bam>

Note that seqlevels only has REL606.

I will also note that while trying to figure out what is going on, some ids in the GFF file appear more than once. For example, "Name" might appear twice for the same CDS, like "ID=ECB_00001.p01;Name=ECB_00001.p01;Parent=ECB_00001.t01;Name=thrL". This could lead to other problems down the road.

ewallace commented 4 years ago

We discussed this at 30th September meeting. The problem is probably that @acope3 is running with a genome fasta and genome-centric .gff, instead of the (approximate) transcriptome .fasta and transcript-centric .gff.

Currently we have a checking script, check_fasta_gff.py. If your fasta and gff file pass that script, then they should work in riboviz. @mikej888 is developing and testing that script in riboviz#74, and any feedback on that is helpful.

Earlier versions of a script for creating transcript fasta/gff are in script_for_transcript_annotation.Rmd. @mikej888 is working on a python-implemented improvement to that, after the checking script which will test its output.

acope3 commented 4 years ago

We discussed this at 30th September meeting. The problem is probably that @acope3 is running with a genome fasta and genome-centric .gff, instead of the (approximate) transcriptome .fasta and transcript-centric .gff.

I believe I was incorrect/unclear on my previous comment and during the meeting today. The fasta file I used is transcriptome-based, which I take to mean each sequence represents a CDS. However, the GFF file provided which resulted in the error is not like Riboviz-style GFF file with the CDS id in the first column. Instead, the first column of the GFF simply uses "REL606," with the CDS id used in the FASTA file found in another column. I think the issue is actually in the following line from bam_to_h5.R:

geneseqname <- as.character(seqnames(gene_location)[1])

Given this GFF file, this is setting the geneseqname to the value found in the first column (always "REL606"), which is obviously not in the .bam file from the mapping to the transcriptome-centric fasta file. This is leading to scanBam errors.

Error in .io_bam(.scan_bamfile, file, reverseComplement, yieldSize(file), : seqlevels(param) not in BAM header: seqlevels: ‘REL606’ file: /data/cope/ecoli/output/AraM1/AraM1.bam index: /data/cope/ecoli/output/AraM1/AraM1.bam Calls: mclapply ... scanBam -> scanBam -> scanBam -> scanBam -> .io_bam

The hard-coded value of 1 for the index seems to assume a Riboviz-style GFF. Perhaps this was accidentally left over when someone was testing this code?

ewallace commented 4 years ago

Argh! We've been avoiding bam_to_h5.R and dealing with everything else, for the last year, see riboviz#23.. So this may take some thought to fix.

This issue may be related to riboviz#171.

acope3 commented 3 years ago

Instead of using a numeric index, could we allow the user to specify the id to pull from the gff file in the case of a non-riboviz style gff? For example, the CDS file I'm using for this dataset uses the locus_id for each CDS. My opinion is this would give the user the greatest flexibility and this would require relatively few changes to bam_to_h5.R (I think). Does this seem reasonable to everyone else? Am I missing something?

acope3 commented 3 years ago

Regarding the mapping of bacterial Ribo-Seq datasets, Green and Buskirk's work suggests that mapping to the 3'-end is preferred. However, my current understanding of the code is that all reads are assigned to the nucleotide at the 5'-end of the read and the A-site displacement is then applied by adding some number of nucleotides to the 5'-end. I think a simple solution for calculating the A-site position for bacterial riboseq datasets without changing how the mapping is done is to modify CalcAsiteFixedOneLength in read_counts_functions.R to check if the value for asite_displacement is negative and, if so, determine the A-site by shifting by the sum of the read_length and the asite_displacement. Does that make sense and what does everyone think? This seems like a more straightforward approach than changing how the mapping is done, but that does not mean it is the best solution.

ewallace commented 3 years ago

There's a workaround solution to 3'-end mapping that doesn't require us to rewrite code. Basically we just hard-code the 3' displacement in the read length mapping file, for example if the A-site were 10nt from the 3'-end then:

read_length asite_disp
20  10
21  11
22  12
23  13
24  14
25  15
26  16
27  17
28  18
29  19
20  20

I didn't find the exact distance from 3'end to A-site explicitly stated on a quick re-read of Allen Buskirk's paper A systematically-revised ribosome profiling method for bacteria. Maybe it's in an earlier paper? And this is a place where it would be really easy to make off-by-one errors. But I think the approach should work as long as we get the numbers right.

JOl-lN commented 3 years ago

That's interesting, if Alex has sorted out why my e coli data wasn't working and we can do 3' mapping then I'd definitely want to try that with my bacterial data. The github repo for that paper has this python notebook which under the Analysis Settings chunk has a line settings['shift'] = 11 # A-site shift which makes it seem like they're only doing a shift of 11.

acope3 commented 3 years ago

@ewallace I recall their paper suggests a 12 nucleotide shift from the 3' end, but I'm not sure if this is including or excluding the base at the 3'-end. If my memory is correct, then the python notebook found by @john-s-f might indicate the 12 nucleotide shift actually includes the nucleotide at the 3'-end.

acope3 commented 3 years ago

@john-s-f informed me that this dataset is currently unpublished (although it will be submitted soon). Should we hold off on adding it to example-datasets until the data is publicly available? I can work with another E. coli dataset in the meantime.

ewallace commented 3 years ago

Either is fine with me. It's not a big problem if we temporarily have an unpublished dataset in example-datasets repository, we should just indicate in the config.yaml that it's unpublished as yet.

Getting one of the Allen Buskirk E. coli datasets in would be amazing.

acope3 commented 3 years ago

Submitted pull request #25.

ewallace commented 3 years ago

Request @acope3 checks the fasta/gff files here with check_fasta_gff.py, just so we know.

mikej888 commented 3 years ago

The latest version of riboviz.tools_check_fasta_gff is in the gff-spec-74 branch.

acope3 commented 3 years ago

I ran the latest version of check_fasta_gff.py for the FASTA and GFF files I used to successfully complete this run. It ran to completion, but it output that each CDS doesn't start with ATG. This is because the CDS file I got from John has 25 nucleotides appended to represent the 5'-UTR. I guess the question is do we want check_fasta_gff.py to assume UTRs are included in the FASTA file or not? We could also make it flexible enough to handle all of these situations if that's worth the effort.

ewallace commented 3 years ago

Hi Alex, The flexibility is in the gff file. If you have the CDS annotated starting at nt26 of the fasta, then check_fasta_gff.py will detect the start codon in the place you want it. That also propagates the information to the counts later on. So a different-length UTR is specified by the CDS co-ordinates in the .gff

I don't think we have this documented clearly, that's #74. We mention it in run-vignette.md, but not very very clearly. It also needs a picture TBH.

Could you post the first ~10 lines of your .gff and .fasta file here so we can take a look? And let's discuss at tomorrow's dev meeting.

acope3 commented 3 years ago

Gah, I really shouldn't comment before I've had my morning coffee. There was a large amount of output which at a quick glance made me think there was some issue with the code. Not true. As far as I can tell, the code is working as expected. The real issue is E. coli uses codons GTG (valine) and TTG (leucine) as start codons fairly frequently. When the code translates the start codon to its amino acid, it thinks the CDS is starting with a V or L.

kavousan commented 3 years ago

And let's discuss at tomorrow's dev meeting.

Noted.

mikej888 commented 3 years ago

check_fasta_gff needs extended to allow non-canonical start codons as arguments (e.g. GTG or TTG (V or L after translation). It would make the code clearer too if we checked for start codons directly rather than their translations. Additional functionality for riboviz/74.

kavousan commented 3 years ago

Can soon upload to example datasets, @acope3 proposed at the call.

@shahpr we think we can talk to Allen Buskirk about this; Alex thinks we may be off by one, so this could be a good starting point for the conversation.

riboviz / example-datasets

Solve issue with and add e. coli data #12