ipums / ipumsr

Request, download, and read IPUMS data in R
https://tech.popdata.org/ipumsr/
Mozilla Public License 2.0
20 stars 4 forks source link

understanding parsing of DDI file using regular expression #73

Closed 00krishna closed 5 months ago

00krishna commented 5 months ago

Hello. I was working on parsing a DDI file and was looking at the IPUMSR source code. One thing I found a bit confusing was a portion of the ddi_read.R file, which seems to parse the <CodInstr> section of the variable node.

Most of the time, the categorical information is contained within the <catgry> tag, however I noticed this section of the code that uses a regular expression to parse that portion of the CodInstr tag. The code is below. My question is, why is it necessary to parse the CodInstr section of the DDI file, and whether this is a common thing. The regular expression is very specific, so I am not sure that it would generalize very well. Is this specific function used only for the specific "total personal income" INCTOT variable, or are there other variables that also have categorical information in the CodInstr tag.

The code from IPUMSR is found in the specified file ddi_read.R starting at line 907.

parse_code_regex <- function(x, vtype) {
  if (vtype %in% c("numeric", "integer")) {
    labels <- fostr_named_capture(
      x,
      "^(?<val>-?[0-9.,]+)(([[:blank:]][[:punct:]]|[[:punct:]][[:blank:]]|[[:blank:]]|=)+)(?<lbl>.+?)$",
      only_matches = TRUE
    )

    labels$val <- as.numeric(fostr_replace_all(labels$val, ",", ""))
  } else {
    labels <- fostr_named_capture(
      x,
      "^(?<val>[[:graph:]]+)(([[:blank:]]+[[:punct:]|=]+[[:blank:]])+)(?<lbl>.+)$",
      only_matches = TRUE
    )
  }

  labels
}
robe2037 commented 5 months ago

In general, you shouldn’t need to manually parse IPUMS DDI files, so I’m not totally clear on what the use-case is for doing so.

That being said, most labeled values do come from <catgry> tags in the DDI. However, there are occasional cases where labels are found in the <codInstr> tags, primarily for continuous variables that also include a few labeled values (for instance, ceiling values on income, not-in-universe codes, etc.).

Parsing the <codInstr> tags is somewhat fickle because the text in these tags is often less structured than in other DDI tags. This code you mention is designed for a very specific use case, so I don’t think it’s a priority to update unless there are cases where it is not correctly capturing labeled values. If that's the case, please open a separate issue identifying the variables whose labels are not parsed correctly. You're also welcome to submit a PR with proposed updates to the code you mention.