riboviz / riboviz

Analysis and Visualization of Ribosome Profiling Data
https://riboviz.org
Apache License 2.0
29 stars 11 forks source link

ORF finder? #92

Open ewallace opened 4 years ago

ewallace commented 4 years ago

We need to have an ORF finder for the transcriptome that is flexible to non-AUG start codons. Ideally this would also output ORFs in a gff format testable by extending check_gff_fasta.py

Similar or identical functionality is widely available, for example:

We should have one that is easy to use and slots into the pipeline, with standard/well-defined/tested data formats.

@shahpr do you have some code handy, or a recommendation?

Related to:

kavousan commented 4 years ago

If there was a code, @ewallace would use it straight away.

@shahpr does have some code, but not of production quality/of sharable quality.

No obvious assignee; backlog.

ewallace commented 4 years ago

@ewallace opened branch orf-finder-92, containing test files for input and output of orf-finder

>data 
  > orf_finder_test
    > uORFtest_yeast_4CDS_w_250utrs.fa
    > uORFtest_yeast_4CDS_w_250utrs.gff3
    > uORFtestresults_ATGGTGTTG_minLength10_yeast_4CDS_w_250utrs.gff3
    > uORFtestresults_ATGonly_minLength20_yeast_4CDS_w_250utrs.gff3

This tests functionality for a future script, find_uORFs, that would find upstream ORFs based on an existing main ORF / CDS annotation.

ewallace commented 4 years ago

This relates to to @XUEXUEXUE0's project and we should come back to it once we can run riboviz on the cluster.

XUEXUEXUE0 commented 4 years ago

I think I'm almost finished this pipeline but I am not sure that which command line parameters we need. I added a few simple parameter. Users can use python uORF_finder.py --fasta uORFtest_yeast_4CDS_w_250utrs.fa --gff3 uORFtest_yeast_4CDS_w_250utrs.gff3 --output test2.gff3 --min_length 10

The ouput file and code are on https://github.com/riboviz/riboviz/tree/orf-finder-92/data/orf_finder_test

Do I need to add more parameters in command line? What else needs to be improved in the output file?

FlicAnderson commented 4 years ago

One note for when this script is complete, it's maybe worth moving it into the folder with the rest of the python scripts and tools? Currently it's in the data folder, which I wouldn't expect, and moving it to somewhere like riboviz/riboviz/tools/ would make more sense I think.

ewallace commented 4 years ago

This is great progress! Following today's dev meeting, and a quick review of the code, some suggestions as to how to make further progress:

ewallace commented 4 years ago

Would it be helpful to have a smaller function that just finds the next position of a codon from a list?

For example: find_next_string_pos(string_vector, start_pos, target_list)

With example: find_next_string_pos( ["ATG", "AAA", "TGA"], 1, ["AAA"]) = 2

Then this could be used to find the next in-frame start codon, with input target_list = ["ATG"], or the next in-frame stop codon, with input target_list = ["TGA", "TAG", "TAA"]. To find an ORF from a vector of codons, the stop codon would have to be searched for starting at the position one after that returned by the previous start codon.

(this might already be some kind of regex functionality)

mikej888 commented 4 years ago

@XUEXUEXUE0 ...

You'll find it far easier to write unit tests for your other parts of your script if you refactor your script to ensure that all its functionality is wrapped within functions. These functions can then each have isolated tests for them. An initial refactoring would be into something like:

The script should also include a condition to check if the script is being run as a command-line script or is being imported. If it is being run as a command-line script then it calls invoke_find_orfs:

if __name__ == "__main__":
    invoke_find_orfs()

This style of design means that any test script can import uORF_finder.py and test its functions without the script itself being run as if it was a command-line function. For an example of this style of design see:

You wrote that:

My initial idea is to create simple input .fasta file and .gff3 first. Then create a test output .gff3 file. Finally check if the output .gff3 file we create is equal to the .gff3 file produced by the script using assert statement. Is my idea correct and reasonable?

This could be the first test function written for your script. A test function would invoke find_orfs, load output_file, load your test .gff3 file then compare their contents. As .gff3 files are tab-separated values files you can load these as Pandas data frames then compare them using pandas.DataFrame.equals. For example:

import pandas as pd

assert data1.equals(data2)

An approach more in the spirit of unit testing is to use pandas.testing.assert_frame_equal. For example:

from pandas.testing import assert_frame_equal

assert_frame_equal(data1, data2)

As an example:

>>> import pandas as pd
>>> from pandas.testing import assert_frame_equal
>>> data1 = pd.read_csv("data/yeast_CDS_w_250utrs.gff3", sep="\t", comment="#")
>>> data2 = pd.read_csv("data/yeast_CDS_w_250utrs.gff3", sep="\t", comment="#")

These are the same file so the contents are expected to be equal:

>>> assert_frame_equal(data1, data2)

There is no exception so that are equal.

Now load a file we know to be different and try again:

>>> data3 = pd.read_csv("data/yeast_features.tsv", sep="\t", comment="#")
>>> assert_frame_equal(data1, data3)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Anaconda3\lib\site-packages\pandas\util\testing.py", line 1458, in assert_frame_equal
    '{shape!r}'.format(shape=right.shape))
  File "C:\Anaconda3\lib\site-packages\pandas\util\testing.py", line 1094, in raise_assert_detail
    raise AssertionError(msg)
AssertionError: DataFrame are different

DataFrame shape mismatch
[left]:  (17435, 9)
[right]: (5293, 8)

This is what we expect. If this happened within a pytest function then the test would fail as we would want it to.

This provides you with a simple regression test.

You could then refactor find_orfs into additional functions. For example:

find_orfs can then be refactored to call create_gff_db, create_position_dict, process_fasta_records. As its arguments would not change, the regression test you have already written could be used to test that your refactoring did not introduce any bugs.

create_position_dict can be tested using one or more input GFF files with records that would both result in entries being added to positions_dict (e.g. a sequence with both a UTR5 and CDS) and records that would not (e.g. a sequence with an UTR5 only, a sequence with a CDS only). For each test, you can check that and positions_dict contains entries for the sequences with the expected features and that the positions have the expected values.

Test functions for getORF can give getORF values for its arguments and then check that the dictionary of lists it returns has the expected values. An initial set of tests could be:

As Edward suggested, the sequences can be short and you can make them up such that they represent the cases you want to test.

Each test function checks that the values in the list (frame index, length, start position, end position, type, name) are as expected. For the last option above no values would be expected.

The contents of the code.docx file you emailed could be added as comments to the code to document what functions do, and their argument and return types. See https://github.com/riboviz/riboviz/blob/master/riboviz/demultiplex_fastq.py for examples of a format to use.

ewallace commented 4 years ago

I agree. The crucial functionality is in create_position_dict for finding the actual ORFs. The rest is input/output handling. The functionality to find ORFs in a sequence (isn't that create_position_dict?) is independent of the functionality to filter and annotate them based on position relative to the CDS. So I'd suggest, first find ORFs, second classify as upstream/overlapping/NTE/whatever.

XUEXUEXUE0 commented 4 years ago

@mikej888 @ewallace Thank you!

Commit bc2c2a5 refactor the original code. The new script fixes the bug that could generate inframe stop codons and add the function which find the longest orf when the orf have the same stop codon but different start codon. The output of the new script is equal to the test which made by Edward now.

To do:

XUEXUEXUE0 commented 4 years ago

Here is my flowcharts for the whole script and the core function of the script. Hope these make my code easier to understand

Untitled Diagram

corefunction

ewallace commented 4 years ago

Great start! For the overall flow for the "whole script", I would include "find out longest ORFs" in the "identify all ORFs" step.

mikej888 commented 4 years ago

Running the current code...

$ python riboviz/tools/uorf_finder.py --fasta data/orf_finder_test/uORFtest_yeast_4CDS_w_250utrs.fa \
  --gff3 data/orf_finder_test/uORFtest_yeast_4CDS_w_250utrs.gff3 \
  --output output_atg20.gff3 --new_fasta output_atg20.fasta \
  --min_length 20
ModuleNotFoundError: No module named 'BCBio'

bcbio-gff

$ pip install bcbio-gff
Successfully installed bcbio-gff-0.6.6

Use Edward's inputs:

uORFtest_yeast_4CDS_w_250utrs.fa
uORFtest_yeast_4CDS_w_250utrs.gff3

Try to replicate Edward's output uORFtestresults_ATGonly_minLength20_yeast_4CDS_w_250utrs.gff3:

$ python riboviz/tools/uorf_finder.py --fasta data/orf_finder_test/uORFtest_yeast_4CDS_w_250utrs.fa \
  --gff3 data/orf_finder_test/uORFtest_yeast_4CDS_w_250utrs.gff3 \
  --output output_atg20.gff3 --new_fasta output_atg20.fasta \
  --min_length 20

$ cat output_atg20.gff3 
YAL038W riboviz uORF    109 183 .   +   .   Name=YAL038W_uORF_109;frame=1;start_codon=ATG
YOR303W riboviz uORF    117 194 .   +   .   Name=YOR303W_uORF_117;frame=3;start_codon=ATG
YOR335C riboviz uORF    41  169 .   +   .   Name=YOR335C_uORF_41;frame=2;start_codon=ATG

$ cat data/orf_finder_test/uORFtestresults_ATGonly_minLength20_yeast_4CDS_w_250utrs.gff3 
##gff-version 3
##source-ew-manual Edward Wallace made this by hand
##to test upstream ORF finders with ATG starts and 20-codon minimum
## find_uORFs --start_codons ATG --min_length 20 --fasta uORFtest_yeast_4CDS_w_250utrs.fa --gff uORFtest_yeast_4CDS_w_250utrs.gff3
##date 2019-31-10
YAL038W ew-manual   uORF    109 183 .   +   .   Name=YAL038W_uORF_109;start_codon=ATG;
YOR303W ew-manual   uORF    117 194 .   +   .   Name=YOR303W_uORF_117;start_codon=ATG;
YOR335C ew-manual   uORF    41  169 .   +   .   Name=YOR335C_uORF_41;start_codon=ATG;

Compare to Siyin's uORFoutput_ATGonly_minLength20_yeast_4CDS_w_250utrs.gff3:

$ cat data/orf_finder_test/uORFoutput_ATGonly_minLength20_yeast_4CDS_w_250utrs.gff3 
YAL038W riboviz uORF    109 183 .   +   .   Name=YAL038W_uORF_109;frame=0;start_codon=ATG
YOR303W riboviz uORF    117 194 .   +   .   Name=YOR303W_uORF_117;frame=2;start_codon=ATG
YOR335C riboviz uORF    41  169 .   +   .   Name=YOR335C_uORF_41;frame=1;start_codon=ATG

frame values differ (1-indexed vs. 0-indexed). @ewallace Which should it be?

Check output FASTA file:

$ cat output_atg20.fasta 
>YAL005C <unknown description>
TCCATCGGCGGCAAAAGGGAGAGAAAGAACCCAAAAAGAAGGGGGGCCATTTAGATTAGC
...
CGTTGAAGAAGTTGATTAA
>YAL038W <unknown description>
AAAAAAAAGGCTCGCCATCAAAACGATATTCGTTGGCTTTTTTTTCTGAATTATAAATAC
...
CTCTACCGTTTAA
>YOR303W <unknown description>
GAACTATAACACATAGATACACTCGAACATCTAATTGTTTAAATACTGCAAAGAATACAA
...
GAGTATAAATGTTACTAAGTTGGCCAAGGAAAGAGTGTTGTTCTAA
>YOR335C <unknown description>
GGCCGTTTCTTTTTTCTCTTTTTTTTCATCCGCTCGATGGATGATGAGTAAAACAAGAAA
...
CATTTAA

@ewallace Remove <unknown description> or replace with...?

Tthe output FASTA file has records corresponding to all the sequences from the input FASTA file:

$ grep \> data/orf_finder_test/uORFtest_yeast_4CDS_w_250utrs.fa 
>YAL005C
>YAL038W
>YOR303W
>YOR335C

@ewallace Should YAL005C be there? It is not in the output GFF file.

Running with --min_length 10 (instead of 20) gives produces an identical FASTA file as above:

$ python riboviz/tools/uorf_finder.py --fasta data/orf_finder_test/uORFtest_yeast_4CDS_w_250utrs.fa \
  --gff3 data/orf_finder_test/uORFtest_yeast_4CDS_w_250utrs.gff3 \
  --output output_atg10.gff3 --new_fasta output_atg10.fasta --min_length 10

$ cat output_atg10.gff3
YAL005C riboviz uORF    103 132 .   +   .   Name=YAL005C_uORF_103;frame=1;start_codon=ATG
YAL038W riboviz uORF    109 183 .   +   .   Name=YAL038W_uORF_109;frame=1;start_codon=ATG
YOR303W riboviz uORF    117 194 .   +   .   Name=YOR303W_uORF_117;frame=3;start_codon=ATG
YOR335C riboviz uORF    41  169 .   +   .   Name=YOR335C_uORF_41;frame=2;start_codon=ATG

$ cat output_atg10.fasta
>YAL005C <unknown description>
TCCATCGGCGGCAAAAGGGAGAGAAAGAACCCAAAAAGAAGGGGGGCCATTTAGATTAGC
...
CGTTGAAGAAGTTGATTAA
>YAL038W <unknown description>
AAAAAAAAGGCTCGCCATCAAAACGATATTCGTTGGCTTTTTTTTCTGAATTATAAATAC
...
CTCTACCGTTTAA
>YOR303W <unknown description>
GAACTATAACACATAGATACACTCGAACATCTAATTGTTTAAATACTGCAAAGAATACAA
...
GAGTATAAATGTTACTAAGTTGGCCAAGGAAAGAGTGTTGTTCTAA
>YOR335C <unknown description>
GGCCGTTTCTTTTTTCTCTTTTTTTTCATCCGCTCGATGGATGATGAGTAAAACAAGAAA
...
CATTTAA

$ cmp output_atg10.fasta  output_atg20.fasta 
$ echo $?
0

Try to replicate Edward's output uORFtestresults_ATGGTGTTG_minLength10_yeast_4CDS_w_250utrs.gff3:

$ python riboviz/tools/uorf_finder.py --fasta data/orf_finder_test/uORFtest_yeast_4CDS_w_250utrs.fa \
    --gff3 data/orf_finder_test/uORFtest_yeast_4CDS_w_250utrs.gff3 \
    --output output_atggtgttg_10.gff3 --new_fasta output_atggtgttg_10.fasta \
    --min_length 10 --start_codons ATG GTG TTG
$ cat output_atggtgttg_10.gff3 
YAL005C riboviz uORF    100 132 .   +   .   Name=YAL005C_uORF_100;frame=1;start_codon=TTG
YAL005C riboviz uORF    122 181 .   +   .   Name=YAL005C_uORF_122;frame=2;start_codon=TTG
YAL038W riboviz uORF    64  183 .   +   .   Name=YAL038W_uORF_64;frame=1;start_codon=TTG
YAL038W riboviz overlap_uORF    202 258 .   +   .   Name=YAL038W_overlap_uORF_202;frame=1;start_codon=TTG
YAL038W riboviz uORF    185 238 .   +   .   Name=YAL038W_uORF_185;frame=2;start_codon=TTG
YOR303W riboviz uORF    35  124 .   +   .   Name=YOR303W_uORF_35;frame=2;start_codon=TTG
YOR303W riboviz CDS_NTE 200 1486    .   +   .   Name=YOR303W_CDS_NTE_200;frame=2;start_codon=TTG
YOR303W riboviz uORF    117 194 .   +   .   Name=YOR303W_uORF_117;frame=3;start_codon=ATG
YOR335C riboviz uORF    145 243 .   +   .   Name=YOR335C_uORF_145;frame=1;start_codon=GTG
YOR335C riboviz uORF    41  169 .   +   .   Name=YOR335C_uORF_41;frame=2;start_codon=ATG
YOR335C riboviz CDS_NTE 203 3127    .   +   .   Name=YOR335C_CDS_NTE_203;frame=2;start_codon=TTG

$ cat data/orf_finder_test/uORFtestresults_ATGGTGTTG_minLength10_yeast_4CDS_w_250utrs.gff3 
##gff-version 3
##source-ew-manual Edward Wallace made this by hand
##to test upstream ORF finders with ATG,GTG,TTG starts and 10-codon minimum
## find_uORFs --start_codons ATG,GTG,TTG --min_length 10 --Ntermexts TRUE --fasta uORFtest_yeast_4CDS_w_250utrs.fa --gff uORFtest_yeast_4CDS_w_250utrs.gff3
##date 2019-31-10
YAL005C ew-manual   uORF    100 132 .   +   .   Name=YAL005C_uORF_100;start_codon=TTG;
YAL005C ew-manual   uORF    122 181 .   +   .   Name=YAL005C_uORF_122;start_codon=TTG;
YAL038W ew-manual   uORF    64  183 .   +   .   Name=YAL038W_uORF_64;start_codon=TTG;
YAL038W ew-manual   uORF    185 238 .   +   .   Name=YAL038W_uORF_185;start_codon=TTG;
YAL038W ew-manual   overlap_uORF    202 258 .   +   .   Name=YAL038W_uORF_202;start_codon=TTG;
YOR303W ew-manual   uORF    35  124 .   +   .   Name=YOR303W_uORF_35;start_codon=TTG;
YOR303W ew-manual   uORF    117 194 .   +   .   Name=YOR303W_uORF_117;start_codon=ATG;
YOR303W ew-manual   CDS_NTE 200 250 .   +   .   Name=YOR303W_NTE_200;start_codon=TTG;
YOR335C ew-manual   uORF    41  169 .   +   .   Name=YOR335C_uORF_41;start_codon=ATG;
YOR335C ew-manual   uORF    145 243 .   +   .   Name=YOR335C_uORF_145;start_codon=GTG;
YOR335C ew-manual   CDS_NTE 203 250 .   +   .   Name=YOR335C_NTE_145;start_codon=TTG;

Order is different and there are two lines with differing values:

YOR303W ew-manual   CDS_NTE 200 250 .   +   .   Name=YOR303W_NTE_200;start_codon=TTG;

vs:

YOR303W riboviz CDS_NTE 200 1486    .   +   .   Name=YOR303W_CDS_NTE_200;frame=2;start_codon=TTG

and:

YOR335C ew-manual   CDS_NTE 203 250 .   +   .   Name=YOR335C_NTE_145;start_codon=TTG;

vs:

YOR335C riboviz CDS_NTE 203 3127    .   +   .   Name=YOR335C_CDS_NTE_203;frame=2;start_codon=TTG

@ewallace: Why might this discrepancy arise? Does it relate to the --Ntermexts TRUE in your GFF file? There is no corresponding flag within uorf_finder.py.

## find_uORFs --start_codons ATG,GTG,TTG --min_length 10 --Ntermexts TRUE --fasta uORFtest_yeast_4CDS_w_250utrs.fa --gff uORFtest_yeast_4CDS_w_250utrs.gff3

Compare to Siyin's uORFoutput_ATGGTGTTG_minLength10_yeast_4CDS_w_250utrs.gff3:

$ cat data/orf_finder_test/uORFoutput_ATGGTGTTG_minLength10_yeast_4CDS_w_250utrs.gff3 
YAL005C riboviz uORF    100 132 .   +   .   Name=YAL005C_uORF_100;frame=0;start_codon=TTG
YAL005C riboviz uORF    122 181 .   +   .   Name=YAL005C_uORF_122;frame=1;start_codon=TTG
YAL038W riboviz uORF    64  183 .   +   .   Name=YAL038W_uORF_64;frame=0;start_codon=TTG
YAL038W riboviz overlap_uORF    202 258 .   +   .   Name=YAL038W_overlap_uORF_202;frame=0;start_codon=TTG
YAL038W riboviz uORF    185 238 .   +   .   Name=YAL038W_uORF_185;frame=1;start_codon=TTG
YOR303W riboviz uORF    35  124 .   +   .   Name=YOR303W_uORF_35;frame=1;start_codon=TTG
YOR303W riboviz CDS_NTE 200 1486    .   +   .   Name=YOR303W_CDS_NTE_200;frame=1;start_codon=TTG
YOR303W riboviz uORF    117 194 .   +   .   Name=YOR303W_uORF_117;frame=2;start_codon=ATG
YOR335C riboviz uORF    145 243 .   +   .   Name=YOR335C_uORF_145;frame=0;start_codon=GTG
YOR335C riboviz uORF    41  169 .   +   .   Name=YOR335C_uORF_41;frame=1;start_codon=ATG
YOR335C riboviz CDS_NTE 203 3127    .   +   .   Name=YOR335C_CDS_NTE_203;frame=1;start_codon=TTG

Again frame values differ (1-indexed vs. 0-indexed).

ewallace commented 3 years ago

This is interesting but low-priority, because we don't need an ORF finder for ribosome profiling analysis as envisaged for a paper. Moving to "to do", reconsider when we have a paper submitted or another pressing need to find ORFs.