Closed iangow closed 3 years ago
@iangow So it seems there are two problems here
filing_docs/scrape_filing_docs.R
has failed
forms345/update_forms_345_tables.sh
has failed
Let's look at each in isolation
@iangow With regard to filing_docs/scrape_filing_docs.R
, this seems to be the offending piece of code inside the function filing_docs_df
df <-
file_tables %>%
html_table_mod() %>%
bind_rows() %>%
fix_names() %>%
mutate(file_name = file_name,
type = as.character(type),
description = as.character(description)) %>%
separate(col = document,
into = c("document", "document_note"),
sep = "[:space:]+")
In particular, it seems to just be the last part of the command
separate(col = document,
into = c("document", "document_note"),
sep = "[:space:]+")
that's failing. I'm getting the error message
> filing_docs_df('edgar/data/1406815/0000899243-20-026684.txt')
Error in gregexpr(pattern, x, perl = TRUE) :
invalid regular expression '[:space:]+'
In addition: Warning message:
In gregexpr(pattern, x, perl = TRUE) :
So I replaced [:space:]
with [\\s]
, and the code worked, though with some warnings
> file_tables %>%
+ html_table_mod() %>%
+ bind_rows() %>%
+ fix_names() %>%
+ mutate(file_name = file_name,
+ type = as.character(type),
+ description = as.character(description)) %>%
+ separate(col = document,
+ into = c("document", "document_note"),
+ sep = "[\\s]+")
seq description document document_note type size
1 1 10-Q ns1q2010-q.htm iXBRL 10-Q 2313661
2 2 EXHIBIT 10.03 ns1q2010-qex1003.htm <NA> EX-10.03 115323
3 3 EXHIBIT 31.01 ns1q2010-qex3101.htm <NA> EX-31.01 8322
4 4 EXHIBIT 31.02 ns1q2010-qex3102.htm <NA> EX-31.02 8330
5 5 EXHIBIT 32.01 ns1q2010-qex3201.htm <NA> EX-32.01 5360
6 6 EXHIBIT 32.02 ns1q2010-qex3202.htm <NA> EX-32.02 5384
7 12 nslogoa04.jpg <NA> GRAPHIC 102220
8 NA Complete submission text file 0001110805-20-000051.txt <NA> 10923484
9 7 XBRL TAXONOMY EXTENSION SCHEMA DOCUMENT ns-20200331.xsd <NA> EX-101.SCH 49941
10 8 XBRL TAXONOMY EXTENSION CALCULATION LINKBASE DOCUMENT ns-20200331_cal.xml <NA> EX-101.CAL 102954
11 9 XBRL TAXONOMY EXTENSION DEFINITION LINKBASE DOCUMENT ns-20200331_def.xml <NA> EX-101.DEF 365441
12 10 XBRL TAXONOMY EXTENSION LABEL LINKBASE DOCUMENT ns-20200331_lab.xml <NA> EX-101.LAB 630168
13 11 XBRL TAXONOMY EXTENSION PRESENTATION LINKBASE DOCUMENT ns-20200331_pre.xml <NA> EX-101.PRE 440711
14 30 EXTRACTED XBRL INSTANCE DOCUMENT ns1q2010-q_htm.xml <NA> XML 2514052
file_name
1 edgar/data/1110805/0001110805-20-000051.txt
2 edgar/data/1110805/0001110805-20-000051.txt
3 edgar/data/1110805/0001110805-20-000051.txt
4 edgar/data/1110805/0001110805-20-000051.txt
5 edgar/data/1110805/0001110805-20-000051.txt
6 edgar/data/1110805/0001110805-20-000051.txt
7 edgar/data/1110805/0001110805-20-000051.txt
8 edgar/data/1110805/0001110805-20-000051.txt
9 edgar/data/1110805/0001110805-20-000051.txt
10 edgar/data/1110805/0001110805-20-000051.txt
11 edgar/data/1110805/0001110805-20-000051.txt
12 edgar/data/1110805/0001110805-20-000051.txt
13 edgar/data/1110805/0001110805-20-000051.txt
14 edgar/data/1110805/0001110805-20-000051.txt
Warning message:
Expected 2 pieces. Missing pieces filled with `NA` in 13 rows [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14].
@iangow Just started running the program
(base) bdcallen@igow-z640:~/edgar$ filing_docs/scrape_filing_docs.R
Loading required package: xml2
Processing batch 1
Writing data ...
86.19162 seconds
Seems to be working now. I think what happened was that filing_docs_df
returned a bunch of NAs in this snippet here
temp <- mclapply(file_names$file_name, filing_docs_df, mc.cores = 8)
if (length(temp) > 0) {
df <- bind_rows(temp)
if (nrow(df) > 0) {
cat("Writing data ...\n")
dbWriteTable(pg, "filing_docs",
df, append = TRUE, row.names = FALSE)
} else {
cat("No data ...\n")
}
}
leading to the failure in bind_rows, as the list in bind_rows needs to be a list of actual dataframes. I'm going to leave it running till it finishes.
I am rather curious why [:space:]
does not seem to work anymore. Has there been some change to regular expressions in R since we wrote this code?
So I replaced
[:space:]
with[\\s]
, and the code worked, though with some warnings
I think it's better to use the fix in the commit above. [:space]
is equivalent to \\s
, so one needs [[:space]]
to get the equivalent to [\\s]
. I'm not sure why I made the switch to [:space]
, but perhaps better to use [[:space]]
in case there was a good reason.
So I replaced
[:space:]
with[\\s]
, and the code worked, though with some warnings> file_tables %>% + html_table_mod() %>% + bind_rows() %>% + fix_names() %>% + mutate(file_name = file_name, + type = as.character(type), + description = as.character(description)) %>% + separate(col = document, + into = c("document", "document_note"), + sep = "[\\s]+") seq description document document_note type size 1 1 10-Q ns1q2010-q.htm iXBRL 10-Q 2313661 2 2 EXHIBIT 10.03 ns1q2010-qex1003.htm <NA> EX-10.03 115323 3 3 EXHIBIT 31.01 ns1q2010-qex3101.htm <NA> EX-31.01 8322 4 4 EXHIBIT 31.02 ns1q2010-qex3102.htm <NA> EX-31.02 8330 5 5 EXHIBIT 32.01 ns1q2010-qex3201.htm <NA> EX-32.01 5360 6 6 EXHIBIT 32.02 ns1q2010-qex3202.htm <NA> EX-32.02 5384 7 12 nslogoa04.jpg <NA> GRAPHIC 102220 8 NA Complete submission text file 0001110805-20-000051.txt <NA> 10923484 9 7 XBRL TAXONOMY EXTENSION SCHEMA DOCUMENT ns-20200331.xsd <NA> EX-101.SCH 49941 10 8 XBRL TAXONOMY EXTENSION CALCULATION LINKBASE DOCUMENT ns-20200331_cal.xml <NA> EX-101.CAL 102954 11 9 XBRL TAXONOMY EXTENSION DEFINITION LINKBASE DOCUMENT ns-20200331_def.xml <NA> EX-101.DEF 365441 12 10 XBRL TAXONOMY EXTENSION LABEL LINKBASE DOCUMENT ns-20200331_lab.xml <NA> EX-101.LAB 630168 13 11 XBRL TAXONOMY EXTENSION PRESENTATION LINKBASE DOCUMENT ns-20200331_pre.xml <NA> EX-101.PRE 440711 14 30 EXTRACTED XBRL INSTANCE DOCUMENT ns1q2010-q_htm.xml <NA> XML 2514052 file_name 1 edgar/data/1110805/0001110805-20-000051.txt 2 edgar/data/1110805/0001110805-20-000051.txt 3 edgar/data/1110805/0001110805-20-000051.txt 4 edgar/data/1110805/0001110805-20-000051.txt 5 edgar/data/1110805/0001110805-20-000051.txt 6 edgar/data/1110805/0001110805-20-000051.txt 7 edgar/data/1110805/0001110805-20-000051.txt 8 edgar/data/1110805/0001110805-20-000051.txt 9 edgar/data/1110805/0001110805-20-000051.txt 10 edgar/data/1110805/0001110805-20-000051.txt 11 edgar/data/1110805/0001110805-20-000051.txt 12 edgar/data/1110805/0001110805-20-000051.txt 13 edgar/data/1110805/0001110805-20-000051.txt 14 edgar/data/1110805/0001110805-20-000051.txt Warning message: Expected 2 pieces. Missing pieces filled with `NA` in 13 rows [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14].
The warning about NA
values can be suppressed by an argument to separate
. These should be innocuous, as rows without "document_note
" will be common.
For example, in running
./update_edgar.sh
, I see the messages below. This is running on my local server, but I suspect the same issues would appear if you ran the code on the MCCGR server.