tidy-finance / website

This repository hosts the source code for the website tidy-finance.org
https://tidy-finance.org
Other
82 stars 47 forks source link

Wrong month values for CRSP dummy data #101

Closed mathisdrn closed 4 months ago

mathisdrn commented 5 months ago

Replicating Accessing and Managing Financial Data and WRDS Dummy Data to apply Parametric Portfolio Policies creates an error when computing evaluate_portfolio(weights_crsp).

In evaluate_portfolio() at evaluation_capm line : ValueError: zero-size array to reduction operation maximum which has no identity

It likely comes from a mis-alignement of date between evaluation and factors_ff_monthly.

evaluation is constructed from crsp_monthly which I created following WRDS Dummy Data.

From WRDS Dummy Data :

crsp_monthly_dummy = (stock_panel_monthly
  .assign(
    date=lambda x: x["month"]+pd.offsets.MonthEnd(-1),
    ret=lambda x: np.fmax(np.random.normal(size=len(x)), -1),

I am unsure at this point but it may be x["month"]+pd.offsets.MonthEnd(-1) who is incorrect here. I don't have access to WRDS data to compare with dummy data.

Pandas version : 2.2.2 Database : running fine Full replicate, no change to code, only diff is freshly installed environment.

voigtstefan commented 4 months ago

@christophscheuch: Could it be that @mathisdrn problem stems from reading in the Dummy data file (written within an R session) with a Python session? I just checked on my machine with R, which works. Reading in the dummy data with Python, however, fails in the date column: My column "month" only shows entries from 1970-01-01 03:21:24 until 1970-01-01 05:22:07. I tried to look up your old tidy-intelligence blog post to see how to fix the read_sql_query command but could not find it anymore.

Bottomline: We could show how to build tidy_finance_python.sqlite and tidy_finance_r.sqlite in the blog post to make the files immediately usable (without translating the code from R to Python, I guess).

voigtstefan commented 4 months ago

Changed the issue title from "[BUG] Parametric Portfolio Policies" to "Read in dummy dataset with pandas compromises date column "

mathisdrn commented 4 months ago

Hi @voigtstefan, I don't think the issue comes from reading the .sqlite file. My month column properly range from start_date to end_date and is identical across notebook for generating dummy data, .sqlite file and parametric portfolio policies notebook. Also to be clear, dummy data is generated with a Python notebook not R.

Parametric portfolio policies runs without error by adding :

crsp_monthly['month'] = crsp_monthly['month'] + pd.DateOffset(days=1)

right after loading crsp_monthly in parametric portfolio. Or by modifying crsp_monthly['month'] in dummy date with + pd.offsets.MonthBegin(1) instead of + pd.offsets.EndMonth(-1)

I am not sure if this is fully reproducible but here are my values when following first solution described above :

Output :

Capture d’écran 2024-04-15 à 12 08 22 Capture d’écran 2024-04-15 à 12 08 30
christophscheuch commented 4 months ago

I agree with @mathisdrn about the problem and the solution, but will confirm again on Wednesday.

@voigtstefan you shouldn't read the R dummy data with pandas because R stores integers since 1970 and in the Python version we store strings. In an early version of TFWP we had integer conversion from the R database to the correct date in Python, but omitted it because we didn't want to support cross-language operations at the time.

The blog post you might refer to is this one (I'm not aware of anything else related): https://blog.tidy-intelligence.com/posts/data-storage-comparison/

voigtstefan commented 4 months ago

True, my apologies. I completely misread the prerequisites you described!

christophscheuch commented 4 months ago

Indeed crsp_monthly["month"] and factors_ff_monthly["month"] should be beginning of month, but dummy_months is end of month. So the most simple fix is to have

dummy_months = pd.date_range(start_date, end_date, freq="MS") 

instead of pd.date_range(start_date, end_date, freq="MS")