Open lonly7star opened 3 years ago
With more debugging, the issue comes from the data retrieved from the server. The program generate a URL as a request to the website. Then the result is directly shown once open the URL.
Generally the URL is like the following:
https://api.brain-map.org/api/v2/data/Gene/query.json?criteria=%5Bacronym%24in%27Key1%27%2C%27Key2%27%2C%27Key3 %27%5D%2Cproducts%5Bid%24eq1%5D&only=name%2Cacronym&
The test create a URL with 10 keys and expect 10 record return. However, the test that generate with the list will fail.
list_36 = [6437, 19829, 4635, 16444, 4341, 8346, 1622, 15336, 19656, 17943]
samples = GENES.loc[list_36]
This is the URL generated by the system, you can directly copy and paste it into browser to check the return information.
https://api.brain-map.org/api/v2/data/Gene/query.json?criteria=%5Bacronym%24in%27Fam184b%27%2C%27F830225E14Rik%2A%27%2C%279130019P16Rik%27%2C%27Vrk1%27%2C%27Olfr1026%27%2C%27Thbs1%27%2C%27Klhdc9%27%2C%27Gjd2%27%2C%27A230044A09Rik%2A%27%2C%27Palb2%27%5D%2Cproducts%5Bid%24eq1%5D&only=name%2Cacronym&
In this case, it returns 11 elements with one extra repeated element
{"acronym":"9130019P16Rik","name":"RIKEN cDNA 9130019P16 gene"},{"acronym":"9130019P16Rik","name":"RIKEN cDNA 9130019P16 gene"}]}
If we remove this key 9130019P16Rik
, we will have a normal return with 9 elements, try this URL
https://api.brain-map.org/api/v2/data/Gene/query.json?criteria=%5Bacronym%24in%27Fam184b%27%2C%27F830225E14Rik%2A%27%2C%27Vrk1%27%2C%27Olfr1026%27%2C%27Thbs1%27%2C%27Klhdc9%27%2C%27Gjd2%27%2C%27A230044A09Rik%2A%27%2C%27Palb2%27%5D%2Cproducts%5Bid%24eq1%5D&only=name%2Cacronym&
If we only try the URL with key 9130019P16Rik
, we will have 2 return elements
https://api.brain-map.org/api/v2/data/Gene/query.json?criteria=%5Bacronym%24in%279130019P16Rik%27%5D%2Cproducts%5Bid%24eq1%5D&only=name%2Cacronym&
So I suspect that the key 9130019P16Rik
maybe not be properly stored or maybe there is a hash collision that returns duplicated data. The testing itself is probably not responsible for the failure.
the test on tests/mouse/test_gene.py
could fail on when retrive the GENE data with list contains 17943
The original test fail on the 36th run with the current random seed.
the fail happens on the second assertion
names = gene.get_gene_info(acronym=samples['acronym'], attributes='name')
assert sorted(names['name']) == sorted(samples['name'])
when the names is retrieved with "acronym": "9130019P16Rik"
the record from https://api.brain-map.org/api/v2/data/Gene/query.json
gets 2 records back instead of one
here is the URL generated by your program with only "acronym": "9130019P16Rik"
:
https://api.brain-map.org/api/v2/data/Gene/query.json?criteria=%5Bacronym%24in%279130019P16Rik%27%5D%2Cproducts%5Bid%24eq1%5D&only=name%2Cacronym&
here is the response of from the website of that URL:
{"success": true, "id": 0, "start_row": 0, "num_rows": 2, "total_rows": 2, "msg": [{"acronym":"9130019P16Rik","name":"RIKEN cDNA 9130019P16 gene"},{"acronym":"9130019P16Rik","name":"RIKEN cDNA 9130019P16 gene"}]}
The full record from the website is the following:
it seems like the server has two different records with the key "9130019P16Rik" on the column acronym that causing this problem and it is not your code that cause this bug.
I propose the following:
Issue description
There is Order-Dependent flakiness in this testing code. The test will fail on the exact 36th run on a continual run.
Steps to reproduce the issue
pytest -k test_gene.py --flake-finder
What's the expected result?
What's the actual result?
> assert sorted(names['name']) == sorted(samples['name'])
AssertionError: assert ['RIKEN cDNA ...delta 2', ...] == ['RIKEN cDNA ...ining 9', ...]
E At index 1 diff: 'RIKEN cDNA 9130019P16 gene' != 'RIKEN cDNA A230044A09 gene (non-RefSeq)'
E Left contains one more item: 'vaccinia related kinase 1'
E Use -v to get the full diff
Additional details / screenshot
The reason for this failure is the
names
will have 11 lines of Dataframe instead of 10. The Dataframe of "samples" and the retrieved result Dataframe of "name" Noticed the "RIKEN cDNA" has a duplicated line which caused the asserted error.The duplicated key is generated on the code from abagen\abagen\utis.py, line 64
response = urlopen(url)
where it passed a Querying request to the serverhttps://api.brain-map.org/api/v2/data/..
and the server returns with 11 lines of data include the extra line instead of 10 lines.I suspect the flakiness is caused by the server-side that returns the duplicate key/value part.
As to the client part, a suggested fix is to add a check duplication for the return JSON.