synthetichealth / synthea

Synthetic Patient Population Simulator
https://synthetichealth.github.io/synthea
Apache License 2.0
2.15k stars 645 forks source link

Malformed input or input contains unmappable characters #1419

Open dattachandan opened 8 months ago

dattachandan commented 8 months ago

What happened?

When using the exporter class(org.mitre.synthea.export.Exporter) and running the run_synthea app, I can see characters in XML that are causing exceptions, have anyone seen this before and recommend any fix?

image

Environment

- OS:Ubuntu 20.04
- Java: JDK 8 and 11

Relevant log output

java.nio.file.InvalidPathException: Malformed input or input contains unmappable characters: Raquel318_Henr?quez109_4b33f163-b07b-4ec7-b579-5ac453371d4f.xml
        at sun.nio.fs.UnixPath.encode(UnixPath.java:147)
        at sun.nio.fs.UnixPath.<init>(UnixPath.java:71)
        at sun.nio.fs.UnixFileSystem.getPath(UnixFileSystem.java:281)
        at sun.nio.fs.AbstractPath.resolve(AbstractPath.java:53)
        at org.mitre.synthea.export.Exporter.exportRecord(Exporter.java:164)
        at org.mitre.synthea.export.Exporter.export(Exporter.java:56)
        at org.healthlink.exporter.syntheainterface.PatientGenerator.generatePerson(PatientGenerator.java:389)
        at org.healthlink.exporter.syntheainterface.PatientGenerator.lambda$run$2(PatientGenerator.java:221)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
dehall commented 8 months ago

Yes we've seen this before with certain file systems that don't support the accented characters in filenames. (#569) For posterity, do you happen to know the type of filesystem that files are being written to?

The quickest fix is to use only uuid filenames instead of including patient names in filenames, by setting the exporter.use_uuid_filenames config setting to true either in src/main/resources/synthea.properties or on the command line ./run_synthea --exporter.use_uuid_filenames=true ...

Alternatively if you're willing to make code changes you can add some filename sanitization to the org.mitre.synthea.export.Exporter.filename method

dattachandan commented 8 months ago

Yes we've seen this before with certain file systems that don't support the accented characters in filenames. (#569) For posterity, do you happen to know the type of filesystem that files are being written to?

The quickest fix is to use only uuid filenames instead of including patient names in filenames, by setting the exporter.use_uuid_filenames config setting to true either in src/main/resources/synthea.properties or on the command line ./run_synthea --exporter.use_uuid_filenames=true ...

Alternatively if you're willing to make code changes you can add some filename sanitization to the org.mitre.synthea.export.Exporter.filename method

I was running it on macOS Monterrey with Apple File System (APFS). It also happened in a ext4 ubuntu volume