mccgr / edgar

Code to manage data related to SEC EDGAR
31 stars 15 forks source link

Delete bad documents from filing_docs #95

Closed iangow closed 4 years ago

iangow commented 4 years ago
crsp=# SELECT file_name, document FROM edgar.filing_docs WHERE document ~ '\s+' LIMIT 10;
                  file_name                  |              document               
---------------------------------------------+-------------------------------------
 edgar/data/1314223/0001314223-17-000009.txt | ambr-1231201610xk.htm   iXBRL
 edgar/data/1314223/0001314223-17-000011.txt | ambr-1231201610xka.htm   iXBRL
 edgar/data/1356564/0001445866-17-000475.txt | leom-20161231.htm   iXBRL
 edgar/data/1360442/0001445866-17-000585.txt | cbds-20161231.htm   iXBRL
 edgar/data/1373715/0001373715-17-000026.txt | now-20161231x10k.htm   iXBRL
 edgar/data/1411168/0001376474-17-000015.txt | dug-20161231.htm   iXBRL
 edgar/data/1414628/0001414628-16-000045.txt | clpi201610-k.htm   iXBRL
 edgar/data/1414628/0001628280-17-006916.txt | momt201710-k.htm   iXBRL
 edgar/data/1415404/0001415404-17-000010.txt | sats_123116x10kdocument.htm   iXBRL
 edgar/data/1423325/0001445866-17-000448.txt | icnn-20161231.htm   iXBRL
(10 rows)

Once deleted, re-running with revised should re-populate with correct values. It may be necessary to delete by file_name to trigger reprocessing.

iangow commented 4 years ago

I did this on my server here:

igow@igow-ubuntu-mate:~/git/edgar$ psql
psql (12.3 (Ubuntu 12.3-1.pgdg20.04+1))
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)
Type "help" for help.

crsp=# DELETE FROM edgar.filing_docs WHERE document ~ '\s+';
DELETE 41185
crsp=# \q
iangow commented 4 years ago

Oops. This is the query I should've run:

igow@igow-z640:~/git/wrds_pg$ psql
psql (12.3 (Ubuntu 12.3-1.pgdg18.04+1), server 11.8 (Ubuntu 11.8-1.pgdg18.04+1))
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)
Type "help" for help.

crsp=# DELETE FROM edgar.filing_docs WHERE file_name IN (
    SELECT file_name FROM edgar.filing_docs WHERE document ~ '\s+');
DELETE 573983
crsp=#