jackwasey / icd

Fast ICD-10 and ICD-9 comorbidities, decoding and validation in R. NB use main instead of master for default branch.
https://jackwasey.github.io/icd/
GNU General Public License v3.0
241 stars 60 forks source link

Consider using C pointers to speed up interpretation of string IDs #167

Open jackwasey opened 5 years ago

jackwasey commented 5 years ago

Strings are slow. R has an internal 'factor' mechanism, so each unique string only has one memory address. We can exploit this to speed up string processing (assuming same encoding for all strings!)

switch (TYPEOF(id)) {
  // factor or integer IDs
  case INTSXP: {
    idi = as<IntegerVector>(id);
    break;
  }
  case STRSXP: {
    idi = IntegerVector(len);
    // get the global char cache pointers
    // https://cran.r-project.org/doc/manuals/r-release/R-ints.html#The-CHARSXP-cache
    for (R_xlen_t i = 0; i != len; ++i) {
      const char * cstmp = CHAR(STRING_ELT(id, i));
      // push back the memory pointer itself. The global cache may change, but
      // not during this thread's execution?
      // Pointer length may be platform dependent...
      unsigned long pnt = reinterpret_cast<unsigned long>(cstmp);
      idi(i) = pnt;
    }
    break;
  } // switch STRSXP
  case REALSXP: {
    idi = floor((NumericVector)id * 1e6);
    break;
  }
  default:
    stop("ID vector should be numeric, factor or character.");
  } // end switch
jackwasey commented 5 years ago

R doesn't let us encode a string as UTF-8 if it is just ASCII. While all ICD codes are ASCII we might assume that these strings are therefore unique in the global char cache, and thus we can use the memory pointer for all ICD codes.