imsweb / data-generator

This Java library allows to create synthetic (fabricated) NAACCR data files.
Other
6 stars 1 forks source link

Data Generator

Quality Gate Status integration Maven Central

This Java library can be used to create cancer-related synthetic data files.

Download

The library is available on Maven Central.

To include it to your Maven or Gradle project, use the group ID com.imsweb and the artifact ID data-generator.

You can check out the release page for a list of the releases and their changes.

This library requires Java 8 or a more recent version.

Usage

As of version 1.10, the GUI Standalone component of the library has been retired. The free File*Pro software can be used to generate synthetic data using a user-friendly interface.

When embedding the library in your project, the following generators are available:

NAACCR Fixed-columns Generator

The NAACCR generator current provides rules for the following fields:

Here is an example using the NAACCR generator:

// create the generator
NaaccrDataGenerator generator = new NaaccrDataGenerator(LayoutFactory.LAYOUT_ID_NAACCR_18_ABSTRACT);

// generate a single patient with 2 tumors
List<Map<String, String>> patient = generator.generatePatient(2);

// generate a file with 500 tumors, each patient will have a random number of tumors (mostly 1)
generator.generateFile(targetFile, 500)

The generator accepts an additional options object as an input to the generate methods, that object can be used to customize the random data generation of some of the fields.

NAACCR XML Generator

The NAACCR XML generator uses the same rules as the NAACCR fixed-columns generator.

Here is an example using the NAACCR XML generator:

// create the generator
NaaccrXmlDataGenerator generator = new NaaccrXmlDataGenerator(LayoutFactory.LAYOUT_ID_NAACCR_XML_18_ABSTRACT);

// generate a single patient with 2 tumors
Patient patient = generator.generatePatient(2);

// generate a file with 500 tumors, each patient will have a random number of tumors (mostly 1)
generator.generateFile(targetFile, 500)

The generator accepts an additional options object as an input to the generate methods, that object can be used to customize the random data generation of some of the fields.

NAACCR HL7 Generator

The NAACCR HL7 generator provides rules for the following segments:

Here is an example using the NAACCR HL7 generator:

// create the generator
NaaccrHl7DataGenerator generator = new NaaccrHl7DataGenerator(LayoutFactory.LAYOUT_ID_NAACCR_HL7_2_5_1);

// generate a single message
Hl7Message message = generator.generateMessage();

// generate a file with 10 messages
generator.generateFile(targetFile, 10)

The generator accepts an additional options object as an input to the generate methods, that object can be used to customize the random data generation of some of the fields.

Physician and Facility Data Generator

Those generator don't require a layout; they can be used generate physicians and facilities. The data is created from publicly available NPI data files.

Defining Variables

This library supports variables and file formats through the layout framework. A layout object must be used to initialize one of the data generator objects (although some of them supports providing just the layout ID).

Creating Random Data

The library uses rules to create data; each rule being responsible for assigning one or several fields. The NAACCR generators come with a set of basic rules to assign some of the NAACCR fields; the generic generator does't come with any rules.

The library uses three ways to assign values:

  1. Constant values: the rule always assigns the same value to the field.
  2. Random values from a list: the rule assigns a random value from a specific list of values.
  3. Random values based on a frequency: the rule uses a frequency (usually from a CSV file) to get the value to assign; this results in more common values being assigned more often.

In addition to those assignment mechanisms, each rule might have dependencies to the values assigned by previous rules.

The default NAACCR rules use frequencies extracted from the SEER data.

To know more about the default NAACCR rules, check out the rule package.

About SEER

This library was developed through the SEER program.

The Surveillance, Epidemiology and End Results program is a premier source for cancer statistics in the United States. The SEER program collects information on incidence, prevalence and survival from specific geographic areas representing a large portion of the US population and reports on all these data plus cancer mortality data for the entire country.