mccgr / edgar

Code to manage data related to SEC EDGAR
31 stars 15 forks source link

Show projects to Tony and Ben #75

Closed iangow closed 4 years ago

iangow commented 4 years ago

@bdcallen @dingtq I have tried to organise the issues in this repository (edgar) into projects so we can bring some order to the work here:

https://github.com/mccgr/edgar/projects

The "Prepare EDGAR" is a bit of a catch-all project, but the others should be pretty clear.

My inclination would be to slow down the "directEDGAR" project a little bit just to get a handle on what we can do there. Take a modest number of filings, extract the data from directEDGAR's servers and put them on our server in some form. Then take a look and see whether the results would be worth having for a larger sample.

And perhaps we want to close out some projects (e.g., the CUSIP-CIK one) soon-ish as having too many open projects is problematic.

Part of the problem with the MCCGR server is that we have a lot of data that's a bit half-baked. For example, if I can speak about another repository here, even the StreetEvents data, which is perhaps one of the success stories given how many honours students are using it, needs some care (Yvonne is helping with that).

dingtq commented 4 years ago

@iangow @bdcallen Thank you, Ian. I think it helps a lot to understand and keep track of all the issues.

@bdcallen Let's focus on finishing the current issues in these projects and close them out.

Regarding "directEDGAR" project, please finish the following, which I think should be very easy to complete within like an hour, and then switch all your time and effort to close out other projects.

  1. Move get_filing_heading_info.R and write_directEDGAR_csv_files.R to directEDGAR repository.
  2. check and move/remove get_ciks.py to directEDGAR repository, delete related tables accordingly if necessary. Then close issue #72 and #73.

@iangow

Regarding the directEDGAR project, I agree with you to slow it down. But meanwhile, I think it's better for me to finish extracting all the pre-processed data from directEDGAR's servers for the following reasons: 1. directEDGAR's server is getting less and less reliable in the sense that it gets down more and more frequently. 2. we already have CSV files for the universe of EDGAR 10-K, and there are only six files. 3. I have already extracted almost 50% of the data. Obtaining these data is pretty simple, like several clicking and renaming etc. So it does not make a big difference between extracting a small sample vs. a large sample. 4. I chatted with Ben, and it seems that we don't have similar data on our server yet. So is it OK for me to finish extracting all the data, and then we can decide whether to put them all on our server and some extra cleanup work?

Regarding the "half-baked" issue, maybe @bdcallen and I can focus on completing these open issues repository by repository, for the next period? I think organizing issues into projects could also be very helpful.

BTW, do you think it may also help to organize codes in a similar way as what you did for issues? I was having trouble to understand the relationships of codes/documents within some repository. Or maybe this can be addressed through better documentation.

iangow commented 4 years ago

The code should generally be committed with a reference to an associated issue. I think for most "projects" it makes sense to have either a separate folder within the repository (see the edgar respository, though we have some orphan code in the root directory) or in a separate repository (e.g., for directEDGAR considered as a project).

Sure. Go ahead and download the data if it isn't too much trouble. But I think you should check that the data can be stored in a useful way on the server. Also, I think you should document all those manual steps so someone else can do it in six months time or whenever.