scrtlabs / catalyst

An Algorithmic Trading Library for Crypto-Assets in Python
http://enigma.co
Apache License 2.0
2.49k stars 725 forks source link

[question] data availability for backtests #197

Closed gatapia closed 6 years ago

gatapia commented 6 years ago

I am working on a nightly 'refresh' process to update data, re-train models, re-run backtests, etc. But in doing this I realized that data for the last 15 days or so is not available. I checked the status page and it appears data is only available until the 15th of Jan.

Is there any information anywhere on how this will work? Has data on the catalyst servers ceased being updated? Is it a periodic refresh? If so how often can we expect data to be updated? Is this the move towards the enigma platform, if so how do we access that data source?

Any info appreciated (sorry for the million questions, was just a bit of a surprise).

fredfortier commented 6 years ago

It should be daily refresh. Thank you for reporting this. We will let you know as soon as possible.

andrei-nutas commented 6 years ago

Was about to write with regards to the same problem. I was running backtest on XRP from Bitfinex and it seams that it stopped generating outcomes after 2017-12-04. But if I run the backtesting up to 2017-12-04 it works fine.

lacabra commented 6 years ago

Thank you both for reporting this. We consider this a critical bug in the catalyst infrastructure (it's not related to the code repository itself), and I am actively working to troubleshoot it. I believe it's a matter of hours.

andrei-nutas commented 6 years ago

Thanks appreciate the fast reaction

gatapia commented 6 years ago

I just updated btc_usd on bitfinex and I now have data until 2018-01-28 00:00:00 is this correct (i.e. 3 days old?)

gatapia commented 6 years ago

Just did a full clean/refreshed again and now we have data for 2018-01-29 00:00:00. I guess you are still updating your caches? I'll try again in an hour or so and report back

sekamaneka commented 6 years ago

Any progress on this? I seem to not be getting any data after 15.1 as is stated in the above posted status page.

briannewtonpsyd commented 6 years ago

@gatapia How did you get data past 1-15-18? That's all I'm able to pull down. What are you doing to fully clean and refresh, and how do you know what is the latest data that's been pulled down?

fredfortier commented 6 years ago

From what I understand, Bitfinex suddenly started throttling down rate limits which interferes with the download script. We've been adjusting parameters to work around it but evidently it's not resolved. We'll provide a more coherent update on Monday.

On Sat, Feb 3, 2018 at 10:54 PM brinew27 notifications@github.com wrote:

@gatapia https://github.com/gatapia How did you get data past 1-15-18? That's all I'm able to pull down. What are you doing to fully clean and refresh, and how do you know what is the latest data that's been pulled down?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/enigmampc/catalyst/issues/197#issuecomment-362878563, or mute the thread https://github.com/notifications/unsubscribe-auth/ABZ-QpvPNzvz5-KCDMUEKGBaGWb_8blsks5tRSnmgaJpZM4RxrlX .

gatapia commented 6 years ago

@brinew27 I was able to get more data by querying the ExchangeBundle.get_reader directly after doing a catalyst clean-exchange; catalyst ingest-exchange. However when running backtests data is only available until 15/jan

briannewtonpsyd commented 6 years ago

Thanks @gatapia.

@fredfortier @lacabra , Thank you both for looking into this. If it's just Bitfinex throttling, why would it be affecting Poloniex as well (where I'm trading). Along with the live trading issue (which Fred has been working hard on), this has been somewhat of a show stopper for me, as I'm using historical data to train a machine learning algorithm to do trades, and now the data I'm training it with is about 3 weeks old, so its performance is degrading pretty rapidly.

Ingesting Binance data is my next one (though not a show-stopped) for ideal functioning, since training my ML on Poloniex's data and trying to apply that to Binance (where ideally I'd be trading due to much lower commission rates) has definitely led to some issues. =)

lacabra commented 6 years ago

Apologies for the delay, we have been focusing on the release of the data marketplace (Catalyst 0.5) which has been announced and released today, and I am now looking again into this. My initial tests show that the historical pricing data for all three exchanges is indeed stored correctly on the servers, and catalyst ingests it and retrieves it just fine. A random sample of coins across exchanges supports this observation (I'm running the latest version of Catalyst: 0.5.1 as of this writing).

For some reason the status page at https://www.enigma.co/catalyst/status fails to update the dates, which I am investigating further, but as far as I can see the data is available for ingestion and backtesting.

# Exchange Poloniex
df = data.history('amp_btc', ['open','high','low','close','volume'], bar_count=30, frequency="1d")
print(df)

                              close      high       low      open      volume
2018-01-09 00:00:00+00:00  0.000072  0.000078  0.000066  0.000069  175.797907
2018-01-10 00:00:00+00:00  0.000069  0.000075  0.000063  0.000072   67.346690
2018-01-11 00:00:00+00:00  0.000063  0.000069  0.000060  0.000069   54.261279
2018-01-12 00:00:00+00:00  0.000070  0.000072  0.000062  0.000063   44.731335
2018-01-13 00:00:00+00:00  0.000069  0.000075  0.000068  0.000070   33.340186
2018-01-14 00:00:00+00:00  0.000068  0.000073  0.000065  0.000069   35.553520
2018-01-15 00:00:00+00:00  0.000063  0.000069  0.000061  0.000068   32.042545
2018-01-16 00:00:00+00:00  0.000049  0.000064  0.000047  0.000063   60.988427
2018-01-17 00:00:00+00:00  0.000049  0.000054  0.000042  0.000049   42.038478
2018-01-18 00:00:00+00:00  0.000053  0.000057  0.000048  0.000048   41.736072
2018-01-19 00:00:00+00:00  0.000052  0.000056  0.000051  0.000053   19.262759
2018-01-20 00:00:00+00:00  0.000052  0.000055  0.000050  0.000052   13.913008
2018-01-21 00:00:00+00:00  0.000047  0.000052  0.000046  0.000052   27.261551
2018-01-22 00:00:00+00:00  0.000047  0.000052  0.000046  0.000048   19.676022
2018-01-23 00:00:00+00:00  0.000048  0.000049  0.000046  0.000047   11.005895
2018-01-24 00:00:00+00:00  0.000052  0.000053  0.000046  0.000048   20.873229
2018-01-25 00:00:00+00:00  0.000049  0.000053  0.000048  0.000052   23.807841
2018-01-26 00:00:00+00:00  0.000047  0.000049  0.000046  0.000049   13.002861
2018-01-27 00:00:00+00:00  0.000052  0.000055  0.000046  0.000047   30.126841
2018-01-28 00:00:00+00:00  0.000050  0.000052  0.000048  0.000052   17.266434
2018-01-29 00:00:00+00:00  0.000048  0.000050  0.000048  0.000050   17.920394
2018-01-30 00:00:00+00:00  0.000047  0.000051  0.000047  0.000048   24.585311
2018-01-31 00:00:00+00:00  0.000045  0.000047  0.000044  0.000047   14.981092
2018-02-01 00:00:00+00:00  0.000045  0.000053  0.000044  0.000045   44.027583
2018-02-02 00:00:00+00:00  0.000041  0.000045  0.000038  0.000045   28.419236
2018-02-03 00:00:00+00:00  0.000044  0.000044  0.000039  0.000041   14.287807
2018-02-04 00:00:00+00:00  0.000039  0.000044  0.000037  0.000044   13.139417
2018-02-05 00:00:00+00:00  0.000036  0.000039  0.000033  0.000039   23.214927
2018-02-06 00:00:00+00:00  0.000037  0.000039  0.000036  0.000036    7.809977
2018-02-07 00:00:00+00:00  0.000040  0.000041  0.000037  0.000037   15.566479
# Exchange Bittrex
df = data.history('1st_btc', ['open','high','low','close','volume'], bar_count=30, frequency="1d")
print(df)

                              close      high       low      open  \
2018-01-09 00:00:00+00:00  0.000139  0.000160  0.000130  0.000136   
2018-01-10 00:00:00+00:00  0.000116  0.000140  0.000115  0.000139   
2018-01-11 00:00:00+00:00  0.000107  0.000123  0.000097  0.000116   
2018-01-12 00:00:00+00:00  0.000115  0.000118  0.000099  0.000107   
2018-01-13 00:00:00+00:00  0.000125  0.000135  0.000105  0.000115   
2018-01-14 00:00:00+00:00  0.000112  0.000127  0.000111  0.000125   
2018-01-15 00:00:00+00:00  0.000105  0.000121  0.000105  0.000112   
2018-01-16 00:00:00+00:00  0.000082  0.000105  0.000075  0.000105   
2018-01-17 00:00:00+00:00  0.000084  0.000088  0.000069  0.000082   
2018-01-18 00:00:00+00:00  0.000093  0.000101  0.000082  0.000084   
2018-01-19 00:00:00+00:00  0.000091  0.000095  0.000090  0.000091   
2018-01-20 00:00:00+00:00  0.000084  0.000095  0.000083  0.000091   
2018-01-21 00:00:00+00:00  0.000079  0.000086  0.000075  0.000086   
2018-01-22 00:00:00+00:00  0.000076  0.000087  0.000074  0.000079   
2018-01-23 00:00:00+00:00  0.000074  0.000083  0.000073  0.000076   
2018-01-24 00:00:00+00:00  0.000077  0.000080  0.000074  0.000075   
2018-01-25 00:00:00+00:00  0.000078  0.000080  0.000075  0.000076   
2018-01-26 00:00:00+00:00  0.000086  0.000087  0.000075  0.000078   
2018-01-27 00:00:00+00:00  0.000082  0.000086  0.000079  0.000086   
2018-01-28 00:00:00+00:00  0.000093  0.000103  0.000078  0.000081   
2018-01-29 00:00:00+00:00  0.000083  0.000093  0.000082  0.000093   
2018-01-30 00:00:00+00:00  0.000070  0.000091  0.000065  0.000083   
2018-01-31 00:00:00+00:00  0.000070  0.000074  0.000066  0.000070   
2018-02-01 00:00:00+00:00  0.000067  0.000074  0.000063  0.000070   
2018-02-02 00:00:00+00:00  0.000064  0.000068  0.000057  0.000068   
2018-02-03 00:00:00+00:00  0.000065  0.000067  0.000063  0.000064   
2018-02-04 00:00:00+00:00  0.000064  0.000068  0.000063  0.000065   
2018-02-05 00:00:00+00:00  0.000061  0.000072  0.000057  0.000063   
2018-02-06 00:00:00+00:00  0.000059  0.000061  0.000052  0.000060   
2018-02-07 00:00:00+00:00  0.000058  0.000061  0.000056  0.000059  
# Exchange Bitfinex
df = data.history('avt_btc', ['open','high','low','close','volume'], bar_count=30, frequency="1d")
print(df)

                              close      high       low      open  \
2018-01-09 00:00:00+00:00  0.000402  0.000500  0.000375  0.000382   
2018-01-10 00:00:00+00:00  0.000380  0.000663  0.000346  0.000402   
2018-01-11 00:00:00+00:00  0.000405  0.000450  0.000318  0.000385   
2018-01-12 00:00:00+00:00  0.000399  0.000440  0.000392  0.000404   
2018-01-13 00:00:00+00:00  0.000384  0.000435  0.000376  0.000400   
2018-01-14 00:00:00+00:00  0.000362  0.000382  0.000339  0.000381   
2018-01-15 00:00:00+00:00  0.000339  0.000368  0.000330  0.000362   
2018-01-16 00:00:00+00:00  0.000251  0.000335  0.000211  0.000332   
2018-01-17 00:00:00+00:00  0.000293  0.000310  0.000231  0.000251   
2018-01-18 00:00:00+00:00  0.000266  0.000325  0.000266  0.000310   
2018-01-19 00:00:00+00:00  0.000287  0.000292  0.000260  0.000263   
2018-01-20 00:00:00+00:00  0.000325  0.000338  0.000268  0.000280   
2018-01-21 00:00:00+00:00  0.000285  0.000320  0.000280  0.000315   
2018-01-22 00:00:00+00:00  0.000297  0.000338  0.000280  0.000293   
2018-01-23 00:00:00+00:00  0.000327  0.000363  0.000291  0.000314   
2018-01-24 00:00:00+00:00  0.000329  0.000365  0.000320  0.000329   
2018-01-25 00:00:00+00:00  0.000310  0.000333  0.000301  0.000327   
2018-01-26 00:00:00+00:00  0.000366  0.000372  0.000300  0.000310   
2018-01-27 00:00:00+00:00  0.000388  0.000430  0.000352  0.000372   
2018-01-28 00:00:00+00:00  0.000348  0.000390  0.000339  0.000385   
2018-01-29 00:00:00+00:00  0.000350  0.000359  0.000326  0.000350   
2018-01-30 00:00:00+00:00  0.000305  0.000352  0.000290  0.000350   
2018-01-31 00:00:00+00:00  0.000298  0.000316  0.000291  0.000305   
2018-02-01 00:00:00+00:00  0.000295  0.000306  0.000283  0.000300   
2018-02-02 00:00:00+00:00  0.000279  0.000301  0.000270  0.000297   
2018-02-03 00:00:00+00:00  0.000329  0.000330  0.000275  0.000279   
2018-02-04 00:00:00+00:00  0.000311  0.000340  0.000303  0.000330   
2018-02-05 00:00:00+00:00  0.000298  0.000329  0.000280  0.000322   
2018-02-06 00:00:00+00:00  0.000304  0.000339  0.000280  0.000298   
2018-02-07 00:00:00+00:00  0.000294  0.000334  0.000287  0.000307
lacabra commented 6 years ago

Found the culprit, and fixed it. As far as I am seeing right now, there was a small error in the scripts that pulled the end_dates from the server, but the data had been on the server all this time (or most of it). Please let me know if anyone is experiencing issues with this, otherwise I'll close this issue.

zackgow commented 6 years ago

@lacabra I can query some data past the 15th now, but depending on the date range, I still get errors such as:

NoDataAvailableOnExchange: Requested data for trading pair [u'eth_usd'] is not available on exchange ['bitfinex'] in minute frequency at this time. Check http://enigma.co/catalyst/status for market coverage.

lacabra commented 6 years ago

@zackgow please give me example date ranges where you encounter such error so that I can track it down

zackgow commented 6 years ago

@lacabra start_date='2018-01-29', end_date='2018-01-30' throws the error for me. I am upgraded to 0.5.1.

lacabra commented 6 years ago

Thanks @zackgow, I confirm that I can replicate your error, and that there was indeed some missing data in that bundle. That particular one bundle has been fixed. I'm looking into other minute bundles for that same month Jan 2018. In order for the new data to come in you may need to run again catalyst ingest-exchange -x bitfinex -f minute -i eth_usd, and if that doesn't fix it, then run first catalyst clean-exchange -x bitfinex and then then re-ingest.

I uncovered another error, as something got out-of-sync between the catalyst client, and the server that generates the bundles. It has been fixed, and I confirm that data for that particular bundle is now available, and I'm regenerating the data for all other bundles that are missing the last two days of January. I will post an update when it's complete. Here's the code that validates that data is available for that particular bundle:

df = data.history('eth_btc', ['open','high','low','close','volume'], bar_count=10, frequency="1m")
print(df)
                             close     high      low     open     volume
2018-01-29 23:51:00+00:00  0.10448  0.10448  0.10446  0.10447   5.894544
2018-01-29 23:52:00+00:00  0.10447  0.10449  0.10447  0.10449   4.820322
2018-01-29 23:53:00+00:00  0.10439  0.10447  0.10438  0.10447   8.559114
2018-01-29 23:54:00+00:00  0.10441  0.10444  0.10431  0.10444   7.389059
2018-01-29 23:55:00+00:00  0.10434  0.10446  0.10430  0.10446  17.463309
2018-01-29 23:56:00+00:00  0.10444  0.10444  0.10431  0.10434  31.146378
2018-01-29 23:57:00+00:00  0.10444  0.10448  0.10444  0.10448   1.687714
2018-01-29 23:58:00+00:00  0.10443  0.10444  0.10443  0.10444   4.050000
2018-01-29 23:59:00+00:00  0.10460  0.10465  0.10446  0.10446  38.051553
2018-01-30 00:00:00+00:00  0.10439  0.10469  0.10439  0.10460  38.968829
briannewtonpsyd commented 6 years ago

@lacabra I can confirm I am getting missing data errors for this period (Begining Jan 29th) for Poloniex minute, eth_usdt as well, I'll wait for an update prior to testing again to confirm.

EDIT: To report back, I am no longer getting errors when trying to access data from Jan 16th through Feb 7th, but all the data is essentially frozen, with 0 volume and the price never changing from 1245 (eth_usdt minute data on Poloniex). Also oddly, the data seems to stop at Feb 7th at 23:45, instead of midnight, though that's a minor issue.

lacabra commented 6 years ago

@zackgow, @brinew27 All the minute bundles (bitfinex and poloniex) for the month of January have been regenerated. My random sampling of bundles indicates that the issue has been resolved. You will need to clear all locally-stored bundles with catalyst clean-exchange -x poloniex and then re-ingest the data that you need. If you continue experiencing problems, please detail which exchange, currency pair, and interval so that I can dig into it again.

Random sampling yields:

# exchange Poloniex
df = data.history('eth_usdt', ['open','high','low','close','volume'], bar_count=10, frequency="1m")
print(df)

                                 close         high          low         open  \
2018-01-29 23:51:00+00:00  1177.839733  1177.839733  1174.164000  1175.064042   
2018-01-29 23:52:00+00:00  1174.164001  1174.164001  1174.164001  1174.164001   
2018-01-29 23:53:00+00:00  1174.164001  1174.164001  1174.164001  1174.164001   
2018-01-29 23:54:00+00:00  1175.000000  1177.014043  1175.000000  1175.000000   
2018-01-29 23:55:00+00:00  1177.014043  1177.014043  1177.014043  1177.014043   
2018-01-29 23:56:00+00:00  1177.000001  1177.000001  1177.000000  1177.000000   
2018-01-29 23:57:00+00:00  1174.914043  1177.000000  1174.914043  1177.000000   
2018-01-29 23:58:00+00:00  1177.839733  1177.839733  1177.839733  1177.839733   
2018-01-29 23:59:00+00:00  1177.839733  1177.839733  1177.839733  1177.839733   
2018-01-30 00:00:00+00:00  1177.948026  1177.948026  1177.839733  1177.839733   

                                volume  
2018-01-29 23:51:00+00:00   376.050737  
2018-01-29 23:52:00+00:00  1016.055586  
2018-01-29 23:53:00+00:00     0.000000  
2018-01-29 23:54:00+00:00  1326.173051  
2018-01-29 23:55:00+00:00    40.927132  
2018-01-29 23:56:00+00:00   283.607507  
2018-01-29 23:57:00+00:00  9341.204346  
2018-01-29 23:58:00+00:00   626.995668  
2018-01-29 23:59:00+00:00     0.000000  
2018-01-30 00:00:00+00:00  5416.359877  
# exchange Bitfinex
df = data.history('eos_btc', ['open','high','low','close','volume'], bar_count=10, frequency="1m")
print(df)

                              close      high       low      open      volume
2018-01-29 23:51:00+00:00  0.001204  0.001204  0.001204  0.001204    0.000000
2018-01-29 23:52:00+00:00  0.001204  0.001204  0.001204  0.001204   38.498501
2018-01-29 23:53:00+00:00  0.001204  0.001204  0.001204  0.001204   20.085249
2018-01-29 23:54:00+00:00  0.001203  0.001204  0.001203  0.001204   13.141708
2018-01-29 23:55:00+00:00  0.001204  0.001204  0.001204  0.001204  221.953474
2018-01-29 23:56:00+00:00  0.001205  0.001206  0.001204  0.001204  272.401738
2018-01-29 23:57:00+00:00  0.001206  0.001206  0.001206  0.001206    9.630859
2018-01-29 23:58:00+00:00  0.001206  0.001206  0.001206  0.001206    7.637040
2018-01-29 23:59:00+00:00  0.001206  0.001208  0.001206  0.001207  392.917572
2018-01-30 00:00:00+00:00  0.001206  0.001206  0.001206  0.001206    0.000000
briannewtonpsyd commented 6 years ago

@lacabra Thanks Victor! It does seem like it generated data for the missing days. I'm getting a weird error however at a certain part of the data for eth_usdt minute on poloniex it looks like on 12-07-2017. I was able to export 1-15-18 through 1-31-18 without issue. For right now I'll just skip this day, but wanted to report it in case there's an issue with the data at that point (and ideally I'd like my training data to extend back to there.

[2018-02-09 07:53:37.884000] INFO: symbol_export: 2017-12-07 20:56:00+00:00
Traceback (most recent call last):
  File "Z:/Users/Brian/Google Drive/Catalyst/symbol_export.py", line 185, in <module>
    capital_base=100
  File "C:\Users\brian\Anaconda2\envs\catalyst\lib\site-packages\catalyst\utils\run_algo.py", line 551, in run_algorithm
    stats_output=stats_output
  File "C:\Users\brian\Anaconda2\envs\catalyst\lib\site-packages\catalyst\utils\run_algo.py", line 330, in _run
    overwrite_sim_params=False,
  File "C:\Users\brian\Anaconda2\envs\catalyst\lib\site-packages\catalyst\exchange\exchange_algorithm.py", line 352, in run
    data, overwrite_sim_params
  File "C:\Users\brian\Anaconda2\envs\catalyst\lib\site-packages\catalyst\exchange\exchange_algorithm.py", line 309, in run
    data, overwrite_sim_params
  File "C:\Users\brian\Anaconda2\envs\catalyst\lib\site-packages\catalyst\algorithm.py", line 724, in run
    for perf in self.get_generator():
  File "C:\Users\brian\Anaconda2\envs\catalyst\lib\site-packages\catalyst\gens\tradesimulation.py", line 243, in transform
    self._get_minute_message(dt, algo, algo.perf_tracker)
  File "C:\Users\brian\Anaconda2\envs\catalyst\lib\site-packages\catalyst\gens\tradesimulation.py", line 303, in _get_minute_message
    dt, self.data_portal,
  File "C:\Users\brian\Anaconda2\envs\catalyst\lib\site-packages\catalyst\finance\performance\tracker.py", line 357, in handle_minute_close
    account.leverage)
  File "C:\Users\brian\Anaconda2\envs\catalyst\lib\site-packages\catalyst\finance\risk\cumulative.py", line 219, in update
    self.mean_benchmark_returns_cont[dt_loc] * 252
RuntimeWarning: overflow encountered in double_scalars
lacabra commented 6 years ago

Hi @brinew27 glad to hear that the data integrity issue is resolved. What you describe above seems to be a different error, for which I will open a different issue and assign it to someone else, so that we can track it properly. I have checked the data from that day (exported to a csv and plotted), and looks alright to me, other than the fact that there is zero volume for some minutes surrounded by very high or normal activity, but it may be glitches on the exchange: https://docs.google.com/spreadsheets/d/1GYWKoJHBv9W56pdKmf8eXWkOG6WaBvDsg2ZOWWvv2SY/edit?usp=sharing

anthony-dipofi commented 6 years ago

Hello @lacabra, I am having similar issues where some of the historical data after Jan. 16 is seemingly frozen with 0 volume. In particular I have seen this happening with data from poloniex on the pairs LTC_BTC, XRP_BTC, STR_BTC, SC_BTC, XMR_BTC during Jan. 28 - 29. I have tried running catalyst clean-exchange -x poloniex and reingesting the data. Any help would be appreciated, thanks.

briannewtonpsyd commented 6 years ago

@lacabra Hey Victor, I seem to still be having some missing data issues on some pairs on Poloniex. I've cleaned all bundles and re-ingested, and am getting missing data errors from 1-29 to 1-30 for xrp_usdt on Poloniex:

[2018-02-10 07:08:11.313000] INFO: symbol_export: 2018-01-29 23:45:00+00:00
[2018-02-10 07:08:11.339000] INFO: symbol_export: 2018-01-29 23:46:00+00:00
Traceback (most recent call last):
  File "Z:/Users/Brian/Google Drive/Catalyst/symbol_export.py", line 185, in <module>
    capital_base=100
  File "C:\Users\brian\Anaconda2\envs\catalyst\lib\site-packages\catalyst\utils\run_algo.py", line 551, in run_algorithm
    stats_output=stats_output
  File "C:\Users\brian\Anaconda2\envs\catalyst\lib\site-packages\catalyst\utils\run_algo.py", line 330, in _run
    overwrite_sim_params=False,
  File "C:\Users\brian\Anaconda2\envs\catalyst\lib\site-packages\catalyst\exchange\exchange_algorithm.py", line 352, in run
    data, overwrite_sim_params
  File "C:\Users\brian\Anaconda2\envs\catalyst\lib\site-packages\catalyst\exchange\exchange_algorithm.py", line 309, in run
    data, overwrite_sim_params
  File "C:\Users\brian\Anaconda2\envs\catalyst\lib\site-packages\catalyst\algorithm.py", line 724, in run
    for perf in self.get_generator():
  File "C:\Users\brian\Anaconda2\envs\catalyst\lib\site-packages\catalyst\gens\tradesimulation.py", line 224, in transform
    for capital_change_packet in every_bar(dt):
  File "C:\Users\brian\Anaconda2\envs\catalyst\lib\site-packages\catalyst\gens\tradesimulation.py", line 137, in every_bar
    handle_data(algo, current_data, dt_to_use)
  File "C:\Users\brian\Anaconda2\envs\catalyst\lib\site-packages\catalyst\utils\events.py", line 216, in handle_data
    dt,
  File "C:\Users\brian\Anaconda2\envs\catalyst\lib\site-packages\catalyst\utils\events.py", line 235, in handle_data
    self.callback(context, data)
  File "C:\Users\brian\Anaconda2\envs\catalyst\lib\site-packages\catalyst\exchange\exchange_algorithm.py", line 330, in handle_data
    super(ExchangeTradingAlgorithmBacktest, self).handle_data(data)
  File "C:\Users\brian\Anaconda2\envs\catalyst\lib\site-packages\catalyst\algorithm.py", line 473, in handle_data
    self._handle_data(self, data)
  File "Z:/Users/Brian/Google Drive/Catalyst/symbol_export.py", line 84, in handle_data
    frequency=context.CANDLE_SIZE
  File "catalyst\_protocol.pyx", line 120, in catalyst._protocol.check_parameters.__call__.assert_keywords_and_call
  File "catalyst\_protocol.pyx", line 679, in catalyst._protocol.BarData.history
  File "C:\Users\brian\Anaconda2\envs\catalyst\lib\site-packages\catalyst\exchange\exchange_data_portal.py", line 95, in get_history_window
    ffill))
  File "C:\Users\brian\Anaconda2\envs\catalyst\lib\site-packages\redo\__init__.py", line 162, in retry
    return action(*args, **kwargs)
  File "C:\Users\brian\Anaconda2\envs\catalyst\lib\site-packages\catalyst\exchange\exchange_data_portal.py", line 69, in _get_history_window
    ffill)
  File "C:\Users\brian\Anaconda2\envs\catalyst\lib\site-packages\catalyst\exchange\exchange_data_portal.py", line 313, in get_exchange_history_window
    trailing_bar_count=trailing_bar_count,
  File "C:\Users\brian\Anaconda2\envs\catalyst\lib\site-packages\catalyst\exchange\exchange_bundle.py", line 901, in get_history_window_series_and_load
    trailing_bar_count=trailing_bar_count,
  File "C:\Users\brian\Anaconda2\envs\catalyst\lib\site-packages\catalyst\exchange\exchange_bundle.py", line 1014, in get_history_window_series
    end_dt=end_dt
catalyst.exchange.exchange_errors.PricingDataNotLoadedError: Missing data for poloniex xrp_usdt in date range [2018-01-29 11:31:00+00:00 - 2018-01-30 00:00:00+00:00]
Please run: `catalyst ingest-exchange -x poloniex -f minute -i xrp_usdt`. See catalyst documentation for details.
lacabra commented 6 years ago

@goolulusaurs, @brinew27 yes I confirm that what you are experiencing is all related. I will look into it later today or tomorrow. I have re-opened this issue.

gatapia commented 6 years ago

moving from #190:

I have done a clean-exchange and fresh ingest (with no errors) and I have missing data for ltc_usdt pair on poloniex.

I checked the data ingested (dumped to csv) and there is data missing from the 19th of Jan to the 1st of Feb for ltc_usdt. I have not checked other coins.

zackgow commented 6 years ago

There seems to still be missing data in Bitfinex. I tried the btc_usd pair from 2/04/18 onwards.

lacabra commented 6 years ago

I am working on this issue today, and I confirm that for example, the pair that @brinew27 mentioned is missing data not only on 1/29-1/30, but is flat after 1/19 as @gatapia and @goolulusaurs mention:

xrp_usdt-jan2018

The pricing data is indeed on the server, and once the bundle is re-generated, it has the correct data:

xrp_usdt-jan2018-fixed

I'm digging into it to understand why this happened and redoing these bundles. Will update soon.

@zackgow issue seems different, will look into it next.

lacabra commented 6 years ago

@brinew27, @gatapia, @goolulusaurs, The historical pricing data for Poloniex over the month of January has been fixed. There were many markets that were flat after Jan 16 or Jan 19, and now hold the correct pricing data. Here's a snapshot of the closing prices for all 99 markets on Poloniex over the month of January (previously you could see many flat lines, not anymore):

screen shot 2018-02-12 at 12 51 00 pm
gatapia commented 6 years ago

@lacabra thanks for the work on this but there is much more missing data in the poloniex data than just January. I created a data validation function:

def validate_date_index_integrity(date_index, start='2017-01-01', min_missing_hours=6):
    hourlies = pd.date_range(start, date_index[-1].floor('1H'), freq='1H')
    missing = hourlies[~hourlies.isin(date_index)]
    ranges_missing = []
    current_start, current_dt = missing[0], missing[0]
    for d in missing:
        exp = current_dt + timedelta(hours=1)
        if d > exp:
            if (current_dt - current_start).total_seconds() // 3600 >= min_missing_hours:
                ranges_missing.append((current_start, current_dt))
            current_start = d
        current_dt = d
    if (len(ranges_missing) > 0):
        raise Exception('found %d missing date ranges greater than %d hours' % (len(ranges_missing), min_missing_hours))

If I run this on the date index of the ltc_usdt minute data from poloniex (starting from 2017-01-01) I get: found 38 missing date ranges greater than 6 hours (where 1/19 - 1/2 was just one of the missing ranges).

After your fix - running a clean-exchange / ingest I get (for the same pair, from 2017-01-01): found 37 missing date ranges greater than 6 hours

So only one missing date range was fixed. The other remaining missing date ranges from 2017-01-01 are:

(Timestamp('2017-01-01 00:00:00'), Timestamp('2017-01-01 10:00:00'))
(Timestamp('2017-01-02 00:00:00'), Timestamp('2017-01-02 12:00:00'))
(Timestamp('2017-01-03 03:00:00'), Timestamp('2017-01-03 10:00:00'))
(Timestamp('2017-01-03 19:00:00'), Timestamp('2017-01-04 03:00:00'))
(Timestamp('2017-01-06 17:00:00'), Timestamp('2017-01-06 23:00:00'))
(Timestamp('2017-01-07 01:00:00'), Timestamp('2017-01-07 08:00:00'))
(Timestamp('2017-01-07 14:00:00'), Timestamp('2017-01-07 21:00:00'))
(Timestamp('2017-01-08 13:00:00'), Timestamp('2017-01-08 21:00:00'))
(Timestamp('2017-01-08 23:00:00'), Timestamp('2017-01-09 16:00:00'))
(Timestamp('2017-01-10 02:00:00'), Timestamp('2017-01-10 17:00:00'))
(Timestamp('2017-01-10 22:00:00'), Timestamp('2017-01-11 07:00:00'))
(Timestamp('2017-01-14 21:00:00'), Timestamp('2017-01-15 03:00:00'))
(Timestamp('2017-01-19 00:00:00'), Timestamp('2017-01-19 07:00:00'))
(Timestamp('2017-01-21 14:00:00'), Timestamp('2017-01-21 21:00:00'))
(Timestamp('2017-01-22 13:00:00'), Timestamp('2017-01-22 19:00:00'))
(Timestamp('2017-01-24 03:00:00'), Timestamp('2017-01-24 10:00:00'))
(Timestamp('2017-01-25 12:00:00'), Timestamp('2017-01-26 04:00:00'))
(Timestamp('2017-01-27 15:00:00'), Timestamp('2017-01-28 06:00:00'))
(Timestamp('2017-02-03 17:00:00'), Timestamp('2017-02-04 03:00:00'))
(Timestamp('2017-02-04 05:00:00'), Timestamp('2017-02-04 16:00:00'))
(Timestamp('2017-02-04 19:00:00'), Timestamp('2017-02-05 02:00:00'))
(Timestamp('2017-02-05 23:00:00'), Timestamp('2017-02-06 05:00:00'))
(Timestamp('2017-02-06 07:00:00'), Timestamp('2017-02-06 16:00:00'))
(Timestamp('2017-03-01 20:00:00'), Timestamp('2017-03-02 02:00:00'))
(Timestamp('2017-03-06 14:00:00'), Timestamp('2017-03-06 23:00:00'))
(Timestamp('2017-03-08 17:00:00'), Timestamp('2017-03-09 08:00:00'))
(Timestamp('2017-03-10 00:00:00'), Timestamp('2017-03-10 09:00:00'))
(Timestamp('2017-03-13 04:00:00'), Timestamp('2017-03-13 10:00:00'))
(Timestamp('2017-03-15 03:00:00'), Timestamp('2017-03-15 10:00:00'))
(Timestamp('2017-03-19 03:00:00'), Timestamp('2017-03-19 09:00:00'))
(Timestamp('2017-03-20 11:00:00'), Timestamp('2017-03-20 18:00:00'))
(Timestamp('2017-03-20 21:00:00'), Timestamp('2017-03-21 03:00:00'))
(Timestamp('2017-03-22 06:00:00'), Timestamp('2017-03-22 13:00:00'))
(Timestamp('2017-03-23 01:00:00'), Timestamp('2017-03-23 08:00:00'))
(Timestamp('2017-03-27 21:00:00'), Timestamp('2017-03-28 03:00:00'))
(Timestamp('2017-05-02 19:00:00'), Timestamp('2017-05-03 01:00:00'))
(Timestamp('2017-10-22 03:00:00'), Timestamp('2017-10-22 09:00:00'))
lacabra commented 6 years ago

Thanks @gatapia for uncovering this.

The issue of missing data in January was all due to the same problem that has since been fixed. The cases for missing data that you report in your last comment are likely due to other factors. I need to clearly identify the cause of each problem, and ensure that is addressed properly to avoid the same happening in the future. I reported the fixing of January data for those interested in testing against the most recent timeframe.

I'm digging into the ones that you have uncovered next.

lacabra commented 6 years ago

@gatapia I have run your script, and I don't get any missing dates for the year 2017 for ltc_usdt. I have manually checked the last two ranges mentioned in your last post (Timestamp('2017-10-22 03:00:00'), Timestamp('2017-10-22 09:00:00') and they contain data continuously, this is a sample from the first one (just to confirm, this is from Poloniex for ltc_usdt, displayingcloseandvolume` columns):

2017-10-22 03:00:00+00:00  57.711907      0.000000
2017-10-22 03:01:00+00:00  57.769721    958.546083
2017-10-22 03:02:00+00:00  57.769721      0.000000
2017-10-22 03:03:00+00:00  57.769721      0.000000
2017-10-22 03:04:00+00:00  57.769721      0.000000
2017-10-22 03:05:00+00:00  57.771000     26.070034
2017-10-22 03:06:00+00:00  57.771000      0.000000
2017-10-22 03:07:00+00:00  57.771000      0.000000
2017-10-22 03:08:00+00:00  57.771000      0.000000
2017-10-22 03:09:00+00:00  57.771000      0.000000
2017-10-22 03:10:00+00:00  57.800000   8356.435000
2017-10-22 03:11:00+00:00  57.800000     20.233901
2017-10-22 03:12:00+00:00  57.800000     10.404000
2017-10-22 03:13:00+00:00  57.800000      6.936000
2017-10-22 03:14:00+00:00  57.800000   5507.568357
2017-10-22 03:15:00+00:00  57.770000     60.786807
2017-10-22 03:16:00+00:00  57.712000     17.325046
2017-10-22 03:17:00+00:00  57.766907   1762.941079
2017-10-22 03:18:00+00:00  57.977455     86.250915
2017-10-22 03:19:00+00:00  57.822604     49.226639
2017-10-22 03:20:00+00:00  57.977455      0.067254
2017-10-22 03:21:00+00:00  57.883119    279.506006
2017-10-22 03:22:00+00:00  57.881030    289.405150
2017-10-22 03:23:00+00:00  57.881030    242.504358
2017-10-22 03:24:00+00:00  57.900000   5790.000000
2017-10-22 03:25:00+00:00  57.862517    602.112521
2017-10-22 03:26:00+00:00  57.862517      0.000000
2017-10-22 03:27:00+00:00  57.747736   2922.489776
2017-10-22 03:28:00+00:00  57.747736      0.000000
2017-10-22 03:29:00+00:00  57.752155   7744.853209
2017-10-22 03:30:00+00:00  57.752155      0.000000

If we resample the data in the bundle above in 5min intervals and compare it with what we can fetch from Poloniex directly (https://poloniex.com/public?command=returnChartData&currencyPair=USDT_LTC&start=1508641200&end=1508662800&period=300), they both match:

                         close        volume
date                                        
2017-10-22 03:00:00  57.769721    958.546083
2017-10-22 03:05:00  57.771000     26.070034
2017-10-22 03:10:00  57.800000  13901.577258
2017-10-22 03:15:00  57.822604   1976.530486
2017-10-22 03:20:00  57.900000   6601.482769
2017-10-22 03:25:00  57.747736  11269.455506
2017-10-22 03:30:00  57.768003    183.734865

I would be curious to know what you pass as the date_index parameter to your function to see if we find the difference. The one thing that I notice is that the volume for those exact times is 0, but that is not necessarily an error, that only means that no trades happened on that precise minute.

gatapia commented 6 years ago

Interesting, I am doing a: df.dropna(axis=0, how='any') to remove any rows with any NaNs.

I can remove this and it will fix my problem but should these rows return NaN for open/close, high/low? I mean, even with 0 volume the open/close/high/low are all the same number for that period (the closing price of the previous period)?

Anyways, this can be closed as I can myself fix this in my code. Leave the decision to you.

Thanks heaps for helping me track this down.

lacabra commented 6 years ago

@zackgow thanks for reporting! You help me uncovered another edge case in which some of the lastest bundled pricing was missing yesterday's data. Now it's been fixed moving forward, making the historical pricing data more robust 👍 And btc_usd on Bitfinex is up to date as seen on the top left corner below (Feb 2018 data until yesterday). The rest is a random sample of markets available on Bitfinex, which all have valid data. If you continue experience missing data, please run catalyst clean-exchange -x bitfinex one more time, and ingest again. It should be fixed!

random_sample

Happy backtesting / trading!

lacabra commented 6 years ago

@gatapia quick follow up: the df.dropna(axis=0, how='any') should have no effect on that data.

Answering your question: I can remove this and it will fix my problem but should these rows return NaN for open/close, high/low? I mean, even with 0 volume the open/close/high/low are all the same number for that period (the closing price of the previous period)?

The answer is yes: rows with zero volume will carry forward the last close for open, high, low, close, so there should not be any NaNs there.

See the sample minimal code below:

import pandas as pd

from catalyst import run_algorithm
from catalyst.api import symbol

def initialize(context):
    context.asset = symbol('ltc_usdt')

def handle_data(context, data):
    df = data.history(context.asset, ['open','high','low','close','volume'], bar_count=1440, frequency="1m")
    dx = df.dropna(axis=0, how='any')
    print(dx.equals(df))
    exit(0)

def analyze(context=None, results=None):
    pass

if __name__ == '__main__':
    run_algorithm(
            capital_base=1000,
            data_frequency='minute',
            initialize=initialize,
            handle_data=handle_data,
            analyze=analyze,
            exchange_name='poloniex',
            algo_namespace='testing-datasets',
            base_currency='usdt',
            live=False,
            start=pd.to_datetime('2017-10-23', utc=True), 
            end=pd.to_datetime('2018-10-23', utc=True),
        )

It runs on 2017-10-23 00:00:00 Looks back 1440 minutes (that's 24h over the entire 2017-10-22 day), and fetches all available columns. Then uses your dropna function, storing the result in a separate DataFrame Compares both dataframes, and returns True meaning both dataframes are exactly the same, meaning that dropna did not drop any rows.

I wonder where you get your dataframe from, or whether you do other manipulations beforehand?

lacabra commented 6 years ago

I feel I have carefully addressed each and every data-missing issue reported on this thread, and I am therefore closing this issue. I acknowledge that there are a few instances (mostly dating from 2015 and early 2016) in which the exchange has no data, and thus there may still be flat lines (but it that's what's on the exchange, then Catalyst data is as valid as the exchange's). I don't recall observing any of that for 2017 and 2018. Please prove me wrong, and I will gladly dig deeper.

And re-open the issue, or open a new one, if you experience any inconsistencies with historical data in backtesting.

Cheers

Dan733 commented 6 years ago

I am encountering this error when running a backtest with a fresh ingestion of daily data from bitfinex. Specifically ingesting btc_eur does not solve the problem.

catalyst.exchange.exchange_errors.PricingDataNotLoadedError: Missing data for bitfinex btc_eur in date range [2017-05-19 00:00:00+00:00 - 2017-07-01 00:00:00+00:00]
Please run: `catalyst ingest-exchange -x bitfinex -f daily -i btc_eur`. See catalyst documentation for details.