lonly7star / abagen

A toolbox for working with Allen Human Brain Atlas microarray expression data
https://abagen.readthedocs.io
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

Flaky Test on test_gene.py #1

Open lonly7star opened 3 years ago

lonly7star commented 3 years ago

Issue description

There is Order-Dependent flakiness in this testing code. The test will fail on the exact 36th run on a continual run.

Steps to reproduce the issue

  1. run pytest with flake-finder with command pytest -k test_gene.py --flake-finder
  2. or directly use the list to reproduce the failure.
    list_36 =  [6437, 19829, 17943, 4635, 16444, 4341, 8346, 1622, 19656, 15336]
    samples = GENES.loc[list_36]

What's the expected result?

What's the actual result?

Additional details / screenshot

The reason for this failure is the names will have 11 lines of Dataframe instead of 10. The Dataframe of "samples" samples and the retrieved result Dataframe of "name" names Noticed the "RIKEN cDNA" has a duplicated line which caused the asserted error.

The duplicated key is generated on the code from abagen\abagen\utis.py, line 64 response = urlopen(url) where it passed a Querying request to the server https://api.brain-map.org/api/v2/data/.. and the server returns with 11 lines of data include the extra line instead of 10 lines.

I suspect the flakiness is caused by the server-side that returns the duplicate key/value part.

As to the client part, a suggested fix is to add a check duplication for the return JSON.

lonly7star commented 3 years ago

With more debugging, the issue comes from the data retrieved from the server. The program generate a URL as a request to the website. Then the result is directly shown once open the URL.

Generally the URL is like the following: https://api.brain-map.org/api/v2/data/Gene/query.json?criteria=%5Bacronym%24in%27Key1%27%2C%27Key2%27%2C%27Key3 %27%5D%2Cproducts%5Bid%24eq1%5D&only=name%2Cacronym&

The test create a URL with 10 keys and expect 10 record return. However, the test that generate with the list will fail. list_36 = [6437, 19829, 4635, 16444, 4341, 8346, 1622, 15336, 19656, 17943] samples = GENES.loc[list_36]

This is the URL generated by the system, you can directly copy and paste it into browser to check the return information. https://api.brain-map.org/api/v2/data/Gene/query.json?criteria=%5Bacronym%24in%27Fam184b%27%2C%27F830225E14Rik%2A%27%2C%279130019P16Rik%27%2C%27Vrk1%27%2C%27Olfr1026%27%2C%27Thbs1%27%2C%27Klhdc9%27%2C%27Gjd2%27%2C%27A230044A09Rik%2A%27%2C%27Palb2%27%5D%2Cproducts%5Bid%24eq1%5D&only=name%2Cacronym& In this case, it returns 11 elements with one extra repeated element {"acronym":"9130019P16Rik","name":"RIKEN cDNA 9130019P16 gene"},{"acronym":"9130019P16Rik","name":"RIKEN cDNA 9130019P16 gene"}]}

If we remove this key 9130019P16Rik, we will have a normal return with 9 elements, try this URL https://api.brain-map.org/api/v2/data/Gene/query.json?criteria=%5Bacronym%24in%27Fam184b%27%2C%27F830225E14Rik%2A%27%2C%27Vrk1%27%2C%27Olfr1026%27%2C%27Thbs1%27%2C%27Klhdc9%27%2C%27Gjd2%27%2C%27A230044A09Rik%2A%27%2C%27Palb2%27%5D%2Cproducts%5Bid%24eq1%5D&only=name%2Cacronym&

If we only try the URL with key 9130019P16Rik, we will have 2 return elements https://api.brain-map.org/api/v2/data/Gene/query.json?criteria=%5Bacronym%24in%279130019P16Rik%27%5D%2Cproducts%5Bid%24eq1%5D&only=name%2Cacronym&

So I suspect that the key 9130019P16Rik maybe not be properly stored or maybe there is a hash collision that returns duplicated data. The testing itself is probably not responsible for the failure.

lonly7star commented 3 years ago

the test on tests/mouse/test_gene.py could fail on when retrive the GENE data with list contains 17943 The original test fail on the 36th run with the current random seed.

the fail happens on the second assertion names = gene.get_gene_info(acronym=samples['acronym'], attributes='name') assert sorted(names['name']) == sorted(samples['name'])

when the names is retrieved with "acronym": "9130019P16Rik" the record from https://api.brain-map.org/api/v2/data/Gene/query.json gets 2 records back instead of one

here is the URL generated by your program with only "acronym": "9130019P16Rik": https://api.brain-map.org/api/v2/data/Gene/query.json?criteria=%5Bacronym%24in%279130019P16Rik%27%5D%2Cproducts%5Bid%24eq1%5D&only=name%2Cacronym&

here is the response of from the website of that URL: {"success": true, "id": 0, "start_row": 0, "num_rows": 2, "total_rows": 2, "msg": [{"acronym":"9130019P16Rik","name":"RIKEN cDNA 9130019P16 gene"},{"acronym":"9130019P16Rik","name":"RIKEN cDNA 9130019P16 gene"}]}

The full record from the website is the following: info2 info1

it seems like the server has two different records with the key "9130019P16Rik" on the column acronym that causing this problem and it is not your code that cause this bug.

I propose the following:

  1. do you want to report to the server about this issue
  2. do you want to change your code to detect/remove the duplicate result