Error in specific samples #9

Open Lucas-Maciel opened 4 years ago

Lucas-Maciel commented 4 years ago


I'm using AMON in my metagenomic data. I have 79 MAGs, and in 70 I was able to run it without problems. But for 9 of them I get the following error. I believe that it may be due to a non-recognized KO annotation, but I don't know how to figure out which ones.

amon.py -i ko_list.txt -o ../teste Traceback (most recent call last): File "/home/ABTLUS/lucas.maciel/anaconda3/envs/AMON/bin/amon.py", line 74, in <module> main(kos_loc, output_dir, other_kos_loc, detected_compounds, name1, name2, keep_separated, samples_are_columns, File "/home/ABTLUS/lucas.maciel/anaconda3/envs/AMON/lib/python3.8/site-packages/AMON/predict_metabolites.py", line 283, in main ko_dict = get_kegg_record_dict(set(all_kos), parse_ko, ko_file_loc) File "/home/ABTLUS/lucas.maciel/anaconda3/envs/AMON/lib/python3.8/site-packages/KEGG_parser/downloader.py", line 55, in get_kegg_record_dict records = get_from_kegg_api(loop, list_of_ids, parser) File "/home/ABTLUS/lucas.maciel/anaconda3/envs/AMON/lib/python3.8/site-packages/KEGG_parser/downloader.py", line 49, in get_from_kegg_api return [parser(raw_record) for raw_record in loop.run_until_complete(kegg_download_manager(loop, list_of_ids))] File "/home/ABTLUS/lucas.maciel/anaconda3/envs/AMON/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete return future.result() File "/home/ABTLUS/lucas.maciel/anaconda3/envs/AMON/lib/python3.8/site-packages/KEGG_parser/downloader.py", line 43, in kegg_download_manager results = await asyncio.gather(*tasks) File "/home/ABTLUS/lucas.maciel/anaconda3/envs/AMON/lib/python3.8/site-packages/KEGG_parser/downloader.py", line 30, in download_coroutine return await response.text() File "/home/ABTLUS/lucas.maciel/anaconda3/envs/AMON/lib/python3.8/site-packages/aiohttp/client_reqrep.py", line 1014, in text return self._body.decode(encoding, errors=errors) # type: ignore UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 80011: invalid start byte

Lucas-Maciel commented 4 years ago

Trying to install in other machine I've got other errors

Traceback (most recent call last): File "/home/ABTLUS/lucas.maciel/anaconda3/envs/AMON/bin/amon.py", line 74, in <module> main(kos_loc, output_dir, other_kos_loc, detected_compounds, name1, name2, keep_separated, samples_are_columns, File "/home/ABTLUS/lucas.maciel/anaconda3/envs/AMON/lib/python3.8/site-packages/AMON/predict_metabolites.py", line 283, in main ko_dict = get_kegg_record_dict(set(all_kos), parse_ko, ko_file_loc) File "/home/ABTLUS/lucas.maciel/anaconda3/envs/AMON/lib/python3.8/site-packages/KEGG_parser/downloader.py", line 55, in get_kegg_record_dict records = get_from_kegg_api(loop, list_of_ids, parser) File "/home/ABTLUS/lucas.maciel/anaconda3/envs/AMON/lib/python3.8/site-packages/KEGG_parser/downloader.py", line 49, in get_from_kegg_api return [parser(raw_record) for raw_record in loop.run_until_complete(kegg_download_manager(loop, list_of_ids))] File "/home/ABTLUS/lucas.maciel/anaconda3/envs/AMON/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete return future.result() File "/home/ABTLUS/lucas.maciel/anaconda3/envs/AMON/lib/python3.8/site-packages/KEGG_parser/downloader.py", line 43, in kegg_download_manager results = await asyncio.gather(*tasks) File "/home/ABTLUS/lucas.maciel/anaconda3/envs/AMON/lib/python3.8/site-packages/KEGG_parser/downloader.py", line 35, in download_coroutine raise ValueError('KEGG has forbidden request after %s attempts' % attempts) ValueError: KEGG has forbidden request after 10 attempts

Tim-Sto commented 4 years ago

Hi Lucas-Maciel,

Did you find a solution for the second error that you got in March (KEGG has forbidden request after 10 attempts). I am trying to run AMON and I get the same error message.

Lucas-Maciel commented 4 years ago

@Tim-Sto, to tell the truth, I still don't really understand why but I was able to make it work. if you try to run like 10 times the same input, one of them may work. Another thing that sometimes works and I also don't know why is, if you have a file with 1000 lines and it is not working you can use head -n 1000 files.txt > ko_list.txt and run with this new file.

Tim-Sto commented 4 years ago

@Lucas-Maciel, Thank you very much for your answer. I tried to run the same input 10 times and I also tried to use the head command, but it both did not work. However, I think, I figured out what the problem is. I used a KO list based on the human genome (more than 10.000 KOs) as input for the host. As mentioned in the code, the KEGG API has download limits for those not having a subscription, and probably the limits are reached with a list of 10.000 KOs. @kthurimella, did you already find a workaround for this problem?

shafferm commented 4 years ago

Hey @Lucas-Maciel and @Tim-Sto,

We have looked for ways around this but we have never been able to find the limitations of the KEGG API. In some documentation they mention that it is a rate limitation (e.g. no more than 1000 requests per minute) but they never say what the rate is. My recommendation is to run subsets like @Lucas-Maciel said. If we knew what the KEGG API limits were we could set up AMON to only poll their servers within this limit but since they don't all we can do is guess. We haven't found any better parameters than the ones set as default in AMON to get around it. You can also try using the --save_entries flag to save the output of the KEGG API in json format and then you could analyze the results manually. AMON does not currently support taking those json files as input.

Sorry about the lack of an answer but it seems suprisingly hard to find info in this area.


vindarbot commented 3 years ago

Same problem, Do you know where can i find these files:

--ko_file_loc KO_FILE_LOC Location of ko file from KEGG FTP download (default: None) --rn_file_loc RN_FILE_LOC Location of reaction file from KEGG FTP download (default: None) --co_file_loc CO_FILE_LOC Location of compound file from KEGG FTP download (default: None) --pathway_file_loc PATHWAY_FILE_LOC

In order to not requesting KEGG ?

Thank's by advance

sterrettJD commented 1 year ago

Hi all, I believe I've fixed this issue with the latest release of KEGG_Parser (which is now bumped to 0.0.7 to fix pip compatibility issues). If the asynchronous downloads are forbidden (due to the request rate being too high), it will download the each url from the KEGG API sequentially. This is quite a bit slower, but it does get around the issue.

raeshrode commented 9 months ago

Hello @sterrettJD thank you for updating KEGG_Parser! I am using version 0.0.7 but unfortunately I am getting the same error as @Lucas-Maciel. I am thinking of downloading the KEGG FTP files. Where can I find those? @vindarbot were you able to locate them?

Thanks! :)

sterrettJD commented 9 months ago

Hey @raeshrode , that's weird - I'll look into it! In the meantime, can you post the error from your computer + all the versions for your packages (output of conda list)?

Regarding the KEGG FTP, those files can be accessed here, but unfortunately you need to be a KEGG subscriber to download them :/ which is why we have to download things from KEGG individually

raeshrode commented 9 months ago

Thank you for the quick response @sterrettJD ! Bummer on the KEGG subscription, but thank you for the link to that too.

My AMON environment packages and versions:

My error:

Asynchronous downloading of KEGG records has failed. KEGG parser will try to download data sequentially.This will be slower.
Total urls to download: 1359. Progress will be shown below.
0% |
1 0/1359 [00:10<?, ?it/s]
Traceback (most recent call last) :
File "/Users/rshrode/miniconda3/lib/python3.8/site-packages/KEGG_parser/downloader.py",line78,inget_from_kegg_api
return [parser(raw_record) for raw_record in loop.run_until_complete(kegg_download_manager(loop, list_of_ids)) ]
File "/Users/rshrode/miniconda3/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
return future.result()
File "/Users/rshrode/miniconda3/lib/python3.8/site-packages/KEGG_parser/downloader.py"
, line 47, in kegg_download manager
results = await asyncio.gather(*tasks)
File "/Users/rshrode/miniconda3/lib/python3.8/site-packages/KEGG_parser/downloader.py",line38,indownload_coroutine
raise ValueError('KEGG has forbidden request after %s attempts for url %s
which returns a response status of %s'
ValueError: KEGG has forbidden request after 10 attempts for url http: //rest.kegg.jp/get/K19756+K02335+K17597+K10099+K04207+K04000+06547+K26129+K02616+K25102
hich returns a response status of 403
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/rshrode/miniconda3/bin/amon.py", line 74, in <module>
main(kos_loc, output_dir, other_kos_loc, detected compounds, name1, name2, keep_separated,
samples are columns,
File "/Users/rshrode/miniconda3/lib/ovthon3.8/site-packages/AMON/predict metabolites.py", line 283, in main
ko dict = get kegg record dict(set(all kos), parse ko, ko file loc)
File "/Users/rshrode/miniconda3/lib/python3.8/site-packages/KEGG_parser/downloader.py",line88,inget_kegg_record_dict
records = get_from_kegg_api(loop, list of ids, parser)
File "/Users/rshrode/miniconda3/lib/python3.8/site-packages/KEGG parser/downloader.py", line 83, in get from kegg api
return [parser (raw record) for raw record in kegg download manager synchronous (list of ids) ]
results.append (download svnchronous(url))
File "/Users/rshrode/miniconda3/lib/python3.8/site-packages/KEGG parser/downloader.py", line 59, in download synchronous
raise ValueError("KEGG has forbidden reguest after %s attempts for url %s
which returns a response status of %s
ValueError: KEGG has forbidden request after 10 attempts for urlhtto://rest.kega.ip/aet/K14648+K13903+K06473+K07522+K09456+K12076+K23204+K06910+K04999+K09203
hich returns a response status of 403

Thank you!

sterrettJD commented 8 months ago

Hey @raeshrode , it looks like KEGG_parser is requesting a weird url... In that last line,

htto://rest.kega.ip/aet/K14648+K13903+K06473+K07522+K09456+K12076+K23204+K06910+K04999+K09203 should be http://rest.kegg.jp/get/K14648+K13903+K06473+K07522+K09456+K12076+K23204+K06910+K04999+K09203.

(htto -> http; kega -> kegg; ip -> jp; aet -> get)

I haven't seen this before, and I'm not sure how this string is getting corrupted. Would you be able to email me the command/input data you're using for AMON (john.sterrett@colorado.edu)? I can see if I get the same error on my end.

I could be wrong, but I think that this may be a different issue from what Lucas was dealing with. In this case, AMON is attempting to download the KEGG data in parallel, then when that fails, it's attempting to download the data not in parallel. Lucas's error was due to hitting limits in the number of requests allowed per minute by KEGG, but this error seems to be related to some corruption of the URL string requested...

sterrettJD commented 8 months ago

I tested with @raeshrode 's data and was getting the 403 error but no weird url. I think KEGG now "forbids" requests for longer once a requester is "banned"... That means that the strategy of attempt a parallel download, then try non-parallel if that fails no longer works because users will still be forbidden from KEGG requests when trying to download the data not in parallel for like 30(?) minutes.

Anyway, I've updated KEGG_parser to have an option to not try the parallel downloading that seems to be causing the issue, and I've changed the default behavior of AMON to skip the parallel download attempt. Parallel downloading in AMON can be re-enabled using --download_kegg_async, but I'd recommend against that for now. This unfortunately makes things much slower :( (it'll probably take 60-90 minutes for the downloads)

Rachel, can you try updating AMON -> v1.0.1 and kegg_parser -> v0.0.8, and see if that fixes things? It does on my end (with your data). There's an error downstream when calculating enrichment, but that may be because you're only using one species for the microbial side. I'm hoping/assuming that'll go away once you add more taxa into the mix.