lauramayol / crypto_signal

59 stars 23 forks source link

Google Trends data inconsistent when the timescale is changed #5

Open marchowardbegins opened 5 years ago

marchowardbegins commented 5 years ago

We are using the PyTrends library to grab Google Trends data. This is an API that simply scrapes the Google Trends website using timeframe and keyword parameters that you can set.

The problem is that, for example, when you compare "BTC USD" and "buy bitcoin" the values returned from Google Trends seem to change. In other words since Google Trends returns an index value (1-100) for each keyword, it appears that the index value changes when you change the timeframe.

Why does the happen and is there a way that we can stabilize this so that we can properly use to backtest?

Here is a chart of this anomaly:

screen shot 2018-10-17 at 12 28 37 1

Tilanthi commented 5 years ago

My guess is that the discrepancies are somehow related to the averaging algorithms that Google uses in presenting the data via the web interface that Marc used for his original article, where changing the look-back period results in radically different data sets, with the discrepancies increasing with longer look-back times - where presumably more averaging in being done to reduce the amount of returned data. Compare for example the offset's between the API (in blue) and Google Trends web interface results (in orange) in the 90 days compared to the 30 days data sets in the plots above (both data sets as plotted above only >display< the same 30 days worth of data. Pulling out the raw data from the API does however seem to look much more like the global trend seen in BTC price action. So that there is not a need to re-invent the wheel, I attach below a simple python script to pull the data from the Google api, and an Excel spreadsheet with the extracted data imported, as well as the graphs as shown above. If anyone else has any views over either whether the analysis or data extraction is incorrect, or has knowledge about whether and how the Google web interface data are averaged, please contribute to the discussion.

daily_google_trends.py.zip

trend_output.xlsm.zip

lauramayol commented 5 years ago

Hey guys, I am not quite understanding the graphs. When I use the same parameters manually vs. API, I get the same results. I do understand that "Interest Over Time" data changes based on the parameters, by design:

"Numbers represent search interest relative to the highest point on the chart for the given region and time."

So unless we can get daily data for entire history (or just raw data), we will encounter this challenge. For now, we know we should not use the individual numbers to compare two dates, but we can use the ratio of the numbers.

For me ideally, it would be great if we could get some raw data, not something that changes depending on the dates you pull. We've also discussed weighting daily based on the weekly values which we can pull for full history.

Please let me know if I am misunderstanding also.

Tilanthi commented 5 years ago

Hi Laura - the original issue came from the fact that the Google web interface only returned what appear to be time averaged data (for example averaged into weekly figures for some of the larger look-back periods) if you tried to extract data further than 90 days into the past, and that I could not understand the basis for how they had done their apparent averaging. The averaged figures on a weekly basis from the web interface seemed to be much lower than might be expected by simply following the market price action (hence the offsets you see on the charts). Please note that the charts as displayed may seem a bit confusing - they correspond to 30 days of price action, but pulled from different look back times - and the difference (i.e. the vertical offset) between the web interface and API figures displayed is the issue. So, The API approach does indeed appear to pull out daily figures going back much further in time compared to that available on the Google Trends web interface, but which have much higher numbers as seen in the raw data from the python script, and which to first order seem to follow the general BTC price action, and just look more consistent with what one might expect (what did Monty Python say - "Always expect the unexpected").

What we need for repeating the analysis that Marc did, but over a larger time span is of course daily data, which as far as I can see is what the API approach gives as compared to the Google Trends web interface recovered data - and are the figures imported into the EXCEL spreadsheet. I will add below this email the raw data output file from the python script demo which the Excel spreadsheet was built with after merging with historical BTC price data.

Thinking about this general problem of what one is trying to achieve a little more over the weekend raises another issue that Marc might want to comment about, and that is choice of the search words. In particular the 'buy bitcoin' search term is what one might expect someone new to the crypto world to use, and in some sense the dual keyword pairs used might be expected to be biassed toward new people seeking to enter crypto. I don't know the proportions of the relative purchasing power from fiat -> BTC between the two populations of new people versus existing crypto holders (if one assumes this is related to what is driving the market), but it may be good to brainstorm around the more general issue of what are the best search keywords to achieve to best address the more general question of what is driving the market price - and so we need to think carefully about the most appropriate search terms to address the specific question being addressed. This might be something for Marc to set up in his to do list - I note that there are some people amongst those who have joined recently who might be experienced in this area who could perhaps be consulted, it's more of a market research type of question to define the sample (in this case via key words) being used as the basis of a model.

trend_output.csv.zip

marchowardbegins commented 5 years ago

Pasting this update from group chat :

Marc Howard: Question: Originally in my article I compared the ratio of "Buy Bitcoin" to "BTC USD" as a way to gauge the people that had buying intent to people that already bought and were checking/validating their purchase. Think of the former as the "FOMO" group and the later as the "FUD" group.

In the model that we're coding now we are seeking to find the keyword combinations that best correlate to BTC price. Here are some other keywords that fit this requirement:

FOMO Group: Buy bitcoin How to buy bitcoin Where to buy bitcoin

FUD Group: BTC USD Bitcoin USD Bitcoin price Price of bitcoin

Running these sets through Google Trends we see a new pattern emerging that shows greater volatility in the FOMO camp.

If we ultimately want to determine which combinations of keywords best determine the direction of BTC price what is the best method to test this? Should we perhaps simply aggregate each group? Should we compare every combination of the above to see which has the most predictive influence (if any)?

Here are the two Google Trends groups I created from the list above. (Note that we can also grab this data programmatically via the PyTrends library in our current GitHub project).

FOMO: https://trends.google.com/trends/explore?date=today%205-y&q=buy%20bitcoin,how%20to%20buy%20bitcoin,where%20to%20buy%20bitcoin

FUD: https://trends.google.com/trends/explore?date=today%205-y&q=bitcoin%20price,BTC%20USD,bitcoin%20USD,price%20of%20bitcoin

Any feedback most welcome! (Thx Laura and @Tilanthi for your help on these.)

lauramayol commented 5 years ago

Hi Tilanthi, I am not sure that we are on the same page on the problem to be solved. Here's how I see it:

From my testing, the API returns the same data as manual pull. Just remember you have to use the same parameters for both. This is because Google Trends does not return raw data. Rather, it returns a "score" based on scale 0-100 that changes when you change your parameters. So naturally, if you change your dates, you will get different results. I looked at your example and code, and it seems to me you are comparing the API (20 days at a time) with different dates from the manual pull. So I don't think the problem is between API or GUI.

Also, I've tested and did not need API to pull more than 90 days of daily data. Here are my steps:

So what do I see as a challenge?

  1. Pulling more than 270 days of data (Google only gives us weekly after that, so what if I want year-over-year comparison for a single day?).
  2. Regular data refreshes. Current design has to truncate/insert a rolling 270 days starting from today every day. Ideally, we would just update current day without having to delete past data. However, as explained above, when the data is pulled separately the numbers do not relate.
    • Note: we've discussed weighting daily based on the weekly values which we can pull for full history. This might solve #1 and #2 from above.
  3. Which keywords to use and how to properly compare them? This is Marc's FOMO vs. FUD group descriptions from above.

I think that we are close to solving, but need to be clear on what we are trying to solve. Let me know your thoughts.

lauramayol commented 5 years ago

I also want to add my thoughts on the FOMO vs. FUD group issue:

For the below example and simplicity purpose, let's pretend FOMO group includes "buy bitcoin" only and FUD includes "BTC USD" and "BTC Price"...

There have been discussions about pulling each term individually and then comparing, but I am not understanding why we should do that. I can pull the charts individually for buy bitcoin, BTC USD, BTC Price, but it would be wrong to say that the peak of 100 at Nov 15,2018 for "BTC USD" is more important than value of 91 for "BTC Price". This is because the scores are calculated based on each individual chart so we cannot just add the two numbers together to come up with 191 for the FUD group on Nov 15.

lauramayol commented 5 years ago

UPDATE I checked the links today and although using the same parameters, I get different data. API still matches the GUI but the fact that Google Trends changes the results day over day is highly concerning!

Ratio of "buy bitcoin"/"BTC Price" have changed too...

lauramayol commented 5 years ago

Attached is an updated file that shows the difference between the EXACT same query on two different days. Please see tab "Compare GUI Nov20 vs Nov27" Google Trends - compare API vs GUI.xlsx

Tilanthi commented 5 years ago

Hi Laura - any chance that you could show an image summarising the difference between queries please - it is much easier to visualise things first? Many thanks

lauramayol commented 5 years ago

Does this help? It is the exact same query downloaded on two different days. Wondering if you are getting differences if you try it too. screen shot 2018-11-27 at 3 20 43 pm

Tilanthi commented 5 years ago

Thanks Laura - however, what I actually meant was a graph - rather like the ones at the top of this page, to show the differences - it is really so much easier to show data like this, even if you also upload the raw data as well! Sorry to be a pain - I realise that you are balancing lots of things at the same time !

lauramayol commented 5 years ago

Sorry yes not much time to put things together, but just my point is that the numbers are different and you can see that from the first 4 lines. I'll ask for some help if anyone wants to take this file and visualize.

Tilanthi commented 5 years ago

Hi Laura - Just looking through your Excel file, the differences you are seeing between the GUI and the API results are really minimal - in most cases a few percent in the relatively few cases that I can see - I would not be overly concerned about this level of reproducibility - if there are trends to be found in the data linking these ratios to some kind of sentiment index, I would expect the causal relationships to be far more obvious than the masking effects that such small errors could influence, which to be honest are mostly due to small number statistics. So, personally I would not be too worried about them at this stage. Anyway, good job !

lauramayol commented 5 years ago

Hi Tilanthi, to be clear, my analysis shows two results pulled on different days from the GUI, not using API at all in my screenshot. So what that means, Google gives me two different results for the same query.

lauramayol commented 5 years ago

Ok so I plotted the change in ratio and to me this looks like a significant change worth investigating. We need to understand this better to determine how best to utilize Google Trends.
screen shot 2018-11-27 at 4 25 33 pm