Open ewallace opened 5 years ago
If there was a code, @ewallace would use it straight away.
@shahpr does have some code, but not of production quality/of sharable quality.
No obvious assignee; backlog.
@ewallace opened branch orf-finder-92
, containing test files for input and output of orf-finder
>data
> orf_finder_test
> uORFtest_yeast_4CDS_w_250utrs.fa
> uORFtest_yeast_4CDS_w_250utrs.gff3
> uORFtestresults_ATGGTGTTG_minLength10_yeast_4CDS_w_250utrs.gff3
> uORFtestresults_ATGonly_minLength20_yeast_4CDS_w_250utrs.gff3
This tests functionality for a future script, find_uORFs
, that would find upstream ORFs based on an existing main ORF / CDS annotation.
This relates to to @XUEXUEXUE0's project and we should come back to it once we can run riboviz on the cluster.
I think I'm almost finished this pipeline but I am not sure that which command line parameters we need. I added a few simple parameter.
Users can use
python uORF_finder.py --fasta uORFtest_yeast_4CDS_w_250utrs.fa --gff3 uORFtest_yeast_4CDS_w_250utrs.gff3 --output test2.gff3 --min_length 10
The ouput file and code are on https://github.com/riboviz/riboviz/tree/orf-finder-92/data/orf_finder_test
Do I need to add more parameters in command line? What else needs to be improved in the output file?
One note for when this script is complete, it's maybe worth moving it into the folder with the rest of the python scripts and tools? Currently it's in the data folder, which I wouldn't expect, and moving it to somewhere like riboviz/riboviz/tools/
would make more sense I think.
This is great progress! Following today's dev meeting, and a quick review of the code, some suggestions as to how to make further progress:
start_codons
, with default value ATG
.stop_codons
, with default value TAA,TGA,TAG
(as a string? I don't know)min_length
is the minimum ORF length (in codons/amino acids) returned. So a candidate ATGAAATAA
would have length 2 and not be included. min_length
, protein_length
, and treshold
are all used and I'm confused, could we simplify that?list1
and list2
more informative namesmin_length=20
, and only ATG
as the start codon.riboviz
source replaces ew_manual
, and the _uORF
is not in the feature name.riboviz.check_fasta_gff.py
script. Related to issue #25.Would it be helpful to have a smaller function that just finds the next position of a codon from a list?
For example:
find_next_string_pos(string_vector, start_pos, target_list)
With example:
find_next_string_pos( ["ATG", "AAA", "TGA"], 1, ["AAA"]) = 2
Then this could be used to find the next in-frame start codon, with input target_list = ["ATG"]
, or the next in-frame stop codon, with input target_list = ["TGA", "TAG", "TAA"]
. To find an ORF from a vector of codons, the stop codon would have to be searched for starting at the position one after that returned by the previous start codon.
(this might already be some kind of regex functionality)
@XUEXUEXUE0 ...
You'll find it far easier to write unit tests for your other parts of your script if you refactor your script to ensure that all its functionality is wrapped within functions. These functions can then each have isolated tests for them. An initial refactoring would be into something like:
parse_command_line_arguments
: Set up argparse.ArgumentParser()
and return parser.parse_args()
.find_orfs
: Find the ORFs. This function includes the code to create the GFF database, create the position_dict
and to iterate through the keys and FASTA records i.e. it does everything except parse the command-line arguments. This function would take the arguments you extract from the command-line i.e. myfasta
, mygff2
, output_file
, min_length
, start
, stop
.invoke_find_orfs
: the function for the command-line script. This function calls parse_command_line_arguments
, extracts the values from the arguments, and finds the ORFs by calling find_orfs
.The script should also include a condition to check if the script is being run as a command-line script or is being imported. If it is being run as a command-line script then it calls invoke_find_orfs
:
if __name__ == "__main__":
invoke_find_orfs()
This style of design means that any test script can import uORF_finder.py
and test its functions without the script itself being run as if it was a command-line function. For an example of this style of design see:
if
block plus the command-line parsing functions.You wrote that:
My initial idea is to create simple input .fasta file and .gff3 first. Then create a test output .gff3 file. Finally check if the output .gff3 file we create is equal to the .gff3 file produced by the script using assert statement. Is my idea correct and reasonable?
This could be the first test function written for your script. A test function would invoke find_orfs
, load output_file
, load your test .gff3 file then compare their contents. As .gff3 files are tab-separated values files you can load these as Pandas data frames then compare them using pandas.DataFrame.equals. For example:
import pandas as pd
assert data1.equals(data2)
An approach more in the spirit of unit testing is to use pandas.testing.assert_frame_equal. For example:
from pandas.testing import assert_frame_equal
assert_frame_equal(data1, data2)
As an example:
>>> import pandas as pd
>>> from pandas.testing import assert_frame_equal
>>> data1 = pd.read_csv("data/yeast_CDS_w_250utrs.gff3", sep="\t", comment="#")
>>> data2 = pd.read_csv("data/yeast_CDS_w_250utrs.gff3", sep="\t", comment="#")
These are the same file so the contents are expected to be equal:
>>> assert_frame_equal(data1, data2)
There is no exception so that are equal.
Now load a file we know to be different and try again:
>>> data3 = pd.read_csv("data/yeast_features.tsv", sep="\t", comment="#")
>>> assert_frame_equal(data1, data3)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Anaconda3\lib\site-packages\pandas\util\testing.py", line 1458, in assert_frame_equal
'{shape!r}'.format(shape=right.shape))
File "C:\Anaconda3\lib\site-packages\pandas\util\testing.py", line 1094, in raise_assert_detail
raise AssertionError(msg)
AssertionError: DataFrame are different
DataFrame shape mismatch
[left]: (17435, 9)
[right]: (5293, 8)
This is what we expect. If this happened within a pytest function then the test would fail as we would want it to.
This provides you with a simple regression test.
You could then refactor find_orfs
into additional functions. For example:
create_gff_db
: Create the GFF database.create_position_dict
: Create the positions_dict
dictionary.process_fasta_records
: Iterate through the keys and FASTA records and checks if key == record.id
and, if so, call a new process_fasta_record
function (see below).process_fasta_record
: Process a matching FASTA record. This function includes the contents of the if key == record.id
block and call a new create_fasta_record
function (see below) then append the record to the output file.create_fasta_record
: Create a new FASTA record. This function includes everything from seq = record.seq
to rec.features = [feature]
and returns rec
. This design results in a function that can be tested without needing to use input or output files (like getORF
).find_orfs
can then be refactored to call create_gff_db
, create_position_dict
, process_fasta_records
. As its arguments would not change, the regression test you have already written could be used to test that your refactoring did not introduce any bugs.
create_position_dict
can be tested using one or more input GFF files with records that would both result in entries being added to positions_dict
(e.g. a sequence with both a UTR5 and CDS) and records that would not (e.g. a sequence with an UTR5 only, a sequence with a CDS only). For each test, you can check that and positions_dict
contains entries for the sequences with the expected features and that the positions have the expected values.
Test functions for getORF
can give getORF
values for its arguments and then check that the dictionary of lists it returns has the expected values. An initial set of tests could be:
getORF
a sequence seq
, and end_UTR5
and end_CDS
values such that an ORF of type "uORF" is found and returned.getORF
a sequence seq
, and end_UTR5
and end_CDS
values such that an ORF of type "CDS_NTE" is found and returned.getORF
a sequence seq
, and end_UTR5
and end_CDS
values such that an ORF of type "overlap_UORF" is found and returned.getORF
a sequence seq
, and end_UTR5
and end_CDS
values such that no ORF is found.As Edward suggested, the sequences can be short and you can make them up such that they represent the cases you want to test.
Each test function checks that the values in the list (frame index, length, start position, end position, type, name) are as expected. For the last option above no values would be expected.
The contents of the code.docx file you emailed could be added as comments to the code to document what functions do, and their argument and return types. See https://github.com/riboviz/riboviz/blob/master/riboviz/demultiplex_fastq.py for examples of a format to use.
I agree. The crucial functionality is in create_position_dict
for finding the actual ORFs. The rest is input/output handling. The functionality to find ORFs in a sequence (isn't that create_position_dict
?) is independent of the functionality to filter and annotate them based on position relative to the CDS. So I'd suggest, first find ORFs, second classify as upstream/overlapping/NTE/whatever.
@mikej888 @ewallace Thank you!
Commit bc2c2a5 refactor the original code. The new script fixes the bug that could generate inframe stop codons and add the function which find the longest orf when the orf have the same stop codon but different start codon. The output of the new script is equal to the test which made by Edward now.
To do:
add comments to the script
write tests for the script
make a flowchart for the script
Here is my flowcharts for the whole script and the core function of the script. Hope these make my code easier to understand
Great start! For the overall flow for the "whole script", I would include "find out longest ORFs" in the "identify all ORFs" step.
Running the current code...
$ python riboviz/tools/uorf_finder.py --fasta data/orf_finder_test/uORFtest_yeast_4CDS_w_250utrs.fa \
--gff3 data/orf_finder_test/uORFtest_yeast_4CDS_w_250utrs.gff3 \
--output output_atg20.gff3 --new_fasta output_atg20.fasta \
--min_length 20
ModuleNotFoundError: No module named 'BCBio'
$ pip install bcbio-gff
Successfully installed bcbio-gff-0.6.6
Use Edward's inputs:
uORFtest_yeast_4CDS_w_250utrs.fa
uORFtest_yeast_4CDS_w_250utrs.gff3
Try to replicate Edward's output uORFtestresults_ATGonly_minLength20_yeast_4CDS_w_250utrs.gff3
:
$ python riboviz/tools/uorf_finder.py --fasta data/orf_finder_test/uORFtest_yeast_4CDS_w_250utrs.fa \
--gff3 data/orf_finder_test/uORFtest_yeast_4CDS_w_250utrs.gff3 \
--output output_atg20.gff3 --new_fasta output_atg20.fasta \
--min_length 20
$ cat output_atg20.gff3
YAL038W riboviz uORF 109 183 . + . Name=YAL038W_uORF_109;frame=1;start_codon=ATG
YOR303W riboviz uORF 117 194 . + . Name=YOR303W_uORF_117;frame=3;start_codon=ATG
YOR335C riboviz uORF 41 169 . + . Name=YOR335C_uORF_41;frame=2;start_codon=ATG
$ cat data/orf_finder_test/uORFtestresults_ATGonly_minLength20_yeast_4CDS_w_250utrs.gff3
##gff-version 3
##source-ew-manual Edward Wallace made this by hand
##to test upstream ORF finders with ATG starts and 20-codon minimum
## find_uORFs --start_codons ATG --min_length 20 --fasta uORFtest_yeast_4CDS_w_250utrs.fa --gff uORFtest_yeast_4CDS_w_250utrs.gff3
##date 2019-31-10
YAL038W ew-manual uORF 109 183 . + . Name=YAL038W_uORF_109;start_codon=ATG;
YOR303W ew-manual uORF 117 194 . + . Name=YOR303W_uORF_117;start_codon=ATG;
YOR335C ew-manual uORF 41 169 . + . Name=YOR335C_uORF_41;start_codon=ATG;
Compare to Siyin's uORFoutput_ATGonly_minLength20_yeast_4CDS_w_250utrs.gff3
:
$ cat data/orf_finder_test/uORFoutput_ATGonly_minLength20_yeast_4CDS_w_250utrs.gff3
YAL038W riboviz uORF 109 183 . + . Name=YAL038W_uORF_109;frame=0;start_codon=ATG
YOR303W riboviz uORF 117 194 . + . Name=YOR303W_uORF_117;frame=2;start_codon=ATG
YOR335C riboviz uORF 41 169 . + . Name=YOR335C_uORF_41;frame=1;start_codon=ATG
frame
values differ (1-indexed vs. 0-indexed). @ewallace Which should it be?
Check output FASTA file:
$ cat output_atg20.fasta
>YAL005C <unknown description>
TCCATCGGCGGCAAAAGGGAGAGAAAGAACCCAAAAAGAAGGGGGGCCATTTAGATTAGC
...
CGTTGAAGAAGTTGATTAA
>YAL038W <unknown description>
AAAAAAAAGGCTCGCCATCAAAACGATATTCGTTGGCTTTTTTTTCTGAATTATAAATAC
...
CTCTACCGTTTAA
>YOR303W <unknown description>
GAACTATAACACATAGATACACTCGAACATCTAATTGTTTAAATACTGCAAAGAATACAA
...
GAGTATAAATGTTACTAAGTTGGCCAAGGAAAGAGTGTTGTTCTAA
>YOR335C <unknown description>
GGCCGTTTCTTTTTTCTCTTTTTTTTCATCCGCTCGATGGATGATGAGTAAAACAAGAAA
...
CATTTAA
@ewallace Remove <unknown description>
or replace with...?
Tthe output FASTA file has records corresponding to all the sequences from the input FASTA file:
$ grep \> data/orf_finder_test/uORFtest_yeast_4CDS_w_250utrs.fa
>YAL005C
>YAL038W
>YOR303W
>YOR335C
@ewallace Should YAL005C be there? It is not in the output GFF file.
Running with --min_length 10
(instead of 20
) gives produces an identical FASTA file as above:
$ python riboviz/tools/uorf_finder.py --fasta data/orf_finder_test/uORFtest_yeast_4CDS_w_250utrs.fa \
--gff3 data/orf_finder_test/uORFtest_yeast_4CDS_w_250utrs.gff3 \
--output output_atg10.gff3 --new_fasta output_atg10.fasta --min_length 10
$ cat output_atg10.gff3
YAL005C riboviz uORF 103 132 . + . Name=YAL005C_uORF_103;frame=1;start_codon=ATG
YAL038W riboviz uORF 109 183 . + . Name=YAL038W_uORF_109;frame=1;start_codon=ATG
YOR303W riboviz uORF 117 194 . + . Name=YOR303W_uORF_117;frame=3;start_codon=ATG
YOR335C riboviz uORF 41 169 . + . Name=YOR335C_uORF_41;frame=2;start_codon=ATG
$ cat output_atg10.fasta
>YAL005C <unknown description>
TCCATCGGCGGCAAAAGGGAGAGAAAGAACCCAAAAAGAAGGGGGGCCATTTAGATTAGC
...
CGTTGAAGAAGTTGATTAA
>YAL038W <unknown description>
AAAAAAAAGGCTCGCCATCAAAACGATATTCGTTGGCTTTTTTTTCTGAATTATAAATAC
...
CTCTACCGTTTAA
>YOR303W <unknown description>
GAACTATAACACATAGATACACTCGAACATCTAATTGTTTAAATACTGCAAAGAATACAA
...
GAGTATAAATGTTACTAAGTTGGCCAAGGAAAGAGTGTTGTTCTAA
>YOR335C <unknown description>
GGCCGTTTCTTTTTTCTCTTTTTTTTCATCCGCTCGATGGATGATGAGTAAAACAAGAAA
...
CATTTAA
$ cmp output_atg10.fasta output_atg20.fasta
$ echo $?
0
Try to replicate Edward's output uORFtestresults_ATGGTGTTG_minLength10_yeast_4CDS_w_250utrs.gff3
:
$ python riboviz/tools/uorf_finder.py --fasta data/orf_finder_test/uORFtest_yeast_4CDS_w_250utrs.fa \
--gff3 data/orf_finder_test/uORFtest_yeast_4CDS_w_250utrs.gff3 \
--output output_atggtgttg_10.gff3 --new_fasta output_atggtgttg_10.fasta \
--min_length 10 --start_codons ATG GTG TTG
$ cat output_atggtgttg_10.gff3
YAL005C riboviz uORF 100 132 . + . Name=YAL005C_uORF_100;frame=1;start_codon=TTG
YAL005C riboviz uORF 122 181 . + . Name=YAL005C_uORF_122;frame=2;start_codon=TTG
YAL038W riboviz uORF 64 183 . + . Name=YAL038W_uORF_64;frame=1;start_codon=TTG
YAL038W riboviz overlap_uORF 202 258 . + . Name=YAL038W_overlap_uORF_202;frame=1;start_codon=TTG
YAL038W riboviz uORF 185 238 . + . Name=YAL038W_uORF_185;frame=2;start_codon=TTG
YOR303W riboviz uORF 35 124 . + . Name=YOR303W_uORF_35;frame=2;start_codon=TTG
YOR303W riboviz CDS_NTE 200 1486 . + . Name=YOR303W_CDS_NTE_200;frame=2;start_codon=TTG
YOR303W riboviz uORF 117 194 . + . Name=YOR303W_uORF_117;frame=3;start_codon=ATG
YOR335C riboviz uORF 145 243 . + . Name=YOR335C_uORF_145;frame=1;start_codon=GTG
YOR335C riboviz uORF 41 169 . + . Name=YOR335C_uORF_41;frame=2;start_codon=ATG
YOR335C riboviz CDS_NTE 203 3127 . + . Name=YOR335C_CDS_NTE_203;frame=2;start_codon=TTG
$ cat data/orf_finder_test/uORFtestresults_ATGGTGTTG_minLength10_yeast_4CDS_w_250utrs.gff3
##gff-version 3
##source-ew-manual Edward Wallace made this by hand
##to test upstream ORF finders with ATG,GTG,TTG starts and 10-codon minimum
## find_uORFs --start_codons ATG,GTG,TTG --min_length 10 --Ntermexts TRUE --fasta uORFtest_yeast_4CDS_w_250utrs.fa --gff uORFtest_yeast_4CDS_w_250utrs.gff3
##date 2019-31-10
YAL005C ew-manual uORF 100 132 . + . Name=YAL005C_uORF_100;start_codon=TTG;
YAL005C ew-manual uORF 122 181 . + . Name=YAL005C_uORF_122;start_codon=TTG;
YAL038W ew-manual uORF 64 183 . + . Name=YAL038W_uORF_64;start_codon=TTG;
YAL038W ew-manual uORF 185 238 . + . Name=YAL038W_uORF_185;start_codon=TTG;
YAL038W ew-manual overlap_uORF 202 258 . + . Name=YAL038W_uORF_202;start_codon=TTG;
YOR303W ew-manual uORF 35 124 . + . Name=YOR303W_uORF_35;start_codon=TTG;
YOR303W ew-manual uORF 117 194 . + . Name=YOR303W_uORF_117;start_codon=ATG;
YOR303W ew-manual CDS_NTE 200 250 . + . Name=YOR303W_NTE_200;start_codon=TTG;
YOR335C ew-manual uORF 41 169 . + . Name=YOR335C_uORF_41;start_codon=ATG;
YOR335C ew-manual uORF 145 243 . + . Name=YOR335C_uORF_145;start_codon=GTG;
YOR335C ew-manual CDS_NTE 203 250 . + . Name=YOR335C_NTE_145;start_codon=TTG;
Order is different and there are two lines with differing values:
YOR303W ew-manual CDS_NTE 200 250 . + . Name=YOR303W_NTE_200;start_codon=TTG;
vs:
YOR303W riboviz CDS_NTE 200 1486 . + . Name=YOR303W_CDS_NTE_200;frame=2;start_codon=TTG
and:
YOR335C ew-manual CDS_NTE 203 250 . + . Name=YOR335C_NTE_145;start_codon=TTG;
vs:
YOR335C riboviz CDS_NTE 203 3127 . + . Name=YOR335C_CDS_NTE_203;frame=2;start_codon=TTG
@ewallace: Why might this discrepancy arise? Does it relate to the --Ntermexts TRUE
in your GFF file? There is no corresponding flag within uorf_finder.py
.
## find_uORFs --start_codons ATG,GTG,TTG --min_length 10 --Ntermexts TRUE --fasta uORFtest_yeast_4CDS_w_250utrs.fa --gff uORFtest_yeast_4CDS_w_250utrs.gff3
Compare to Siyin's uORFoutput_ATGGTGTTG_minLength10_yeast_4CDS_w_250utrs.gff3
:
$ cat data/orf_finder_test/uORFoutput_ATGGTGTTG_minLength10_yeast_4CDS_w_250utrs.gff3
YAL005C riboviz uORF 100 132 . + . Name=YAL005C_uORF_100;frame=0;start_codon=TTG
YAL005C riboviz uORF 122 181 . + . Name=YAL005C_uORF_122;frame=1;start_codon=TTG
YAL038W riboviz uORF 64 183 . + . Name=YAL038W_uORF_64;frame=0;start_codon=TTG
YAL038W riboviz overlap_uORF 202 258 . + . Name=YAL038W_overlap_uORF_202;frame=0;start_codon=TTG
YAL038W riboviz uORF 185 238 . + . Name=YAL038W_uORF_185;frame=1;start_codon=TTG
YOR303W riboviz uORF 35 124 . + . Name=YOR303W_uORF_35;frame=1;start_codon=TTG
YOR303W riboviz CDS_NTE 200 1486 . + . Name=YOR303W_CDS_NTE_200;frame=1;start_codon=TTG
YOR303W riboviz uORF 117 194 . + . Name=YOR303W_uORF_117;frame=2;start_codon=ATG
YOR335C riboviz uORF 145 243 . + . Name=YOR335C_uORF_145;frame=0;start_codon=GTG
YOR335C riboviz uORF 41 169 . + . Name=YOR335C_uORF_41;frame=1;start_codon=ATG
YOR335C riboviz CDS_NTE 203 3127 . + . Name=YOR335C_CDS_NTE_203;frame=1;start_codon=TTG
Again frame
values differ (1-indexed vs. 0-indexed).
This is interesting but low-priority, because we don't need an ORF finder for ribosome profiling analysis as envisaged for a paper. Moving to "to do", reconsider when we have a paper submitted or another pressing need to find ORFs.
We need to have an ORF finder for the transcriptome that is flexible to non-AUG start codons. Ideally this would also output ORFs in a gff format testable by extending
check_gff_fasta.py
Similar or identical functionality is widely available, for example:
find_orfs_with_trans
in Biopython tutorialdetectORF.py
, I think after mapping from the genome?We should have one that is easy to use and slots into the pipeline, with standard/well-defined/tested data formats.
@shahpr do you have some code handy, or a recommendation?
Related to:
74 (gff specification to include unusual ORFs and alternative ORFs)
25 (add functionality to
check_gff_fasta.py
)89 (because we need to calculate read frame for these ORFs too)