Open fedorov opened 11 months ago
Updating the .csv file should be no problem, but the color columns would not be used, as they are specified in the .json file.
Do you suggest that we should drop the json file format for terminology specification? It would make sense, as we see that most users are incapable of putting together a valid terminology json file, while they are comfortable with managing the same kind of information in a csv file. However, for this I would not want to develop and maintain a custom converter but some common converter tool would be needed (e.g., a small Python package developed and maintained by IDC).
A note on the color specification in the current totalsegmentator_snomed_mapping_with_partial_colors.csv file. It is quite odd that the RGB color components are dumped into a single column. You already need to do standard CSV parsing to get values, but to get the color you would need to do an additional custom string parsing. Why don't we store the value in 3 columns, such as recommendedDisplayRGBValue.R
, recommendedDisplayRGBValue.G
, recommendedDisplayRGBValue.B
?
Do you suggest that we should drop the json file format for terminology specification?
Not sure. It's just that right now we have 2 problems:
However, for this I would not want to develop and maintain a custom converter but some common converter tool would be needed (e.g., a small Python package developed and maintained by IDC).
You mean converter from JSON to CSV?
Maybe this can be done as part of a python wrapper for dcmqi (something simple similar to https://github.com/AIM-Harvard/pyplastimatch). I would not want to create and maintain a package just for this conversion.
Why don't we store the value in 3 columns, such as recommendedDisplayRGBValue.R, recommendedDisplayRGBValue.G, recommendedDisplayRGBValue.B?
I agree, this makes sense. Thank you for the suggestion!
there is duplication of information between CSV and JSON
Without the colors there is no duplication. The CSV is just a mapping from structure name to terminology codes. All metadata is in the JSON file.
Color would change all this, as it would mean we directly store metadata in the CSV. If we do that then why would we need the JSON file at all? Storing some metadata in the CSV while the same and some more in the JSON file would introduce redundancy and uncertainty (should I use the color recommended in the CSV or what I find by looking up the item in the terminology file)?
If we start generating terminology CSV files with metadata then we acknowledge that the JSON representation is not suitable for all use cases. We should then check if perhaps a CSV representation could be suitable for all use cases, and if the answer is yes then we should retire the JSON representation. If we think that neither JSON nor CSV can fulfill all the needs of all use cases then we need a converter that allows lossless transformation between the two representations (otherwise we would need to manually synchronize changes).
You mean converter from JSON to CSV?
If we want to support both file formats then we need transformation in both directions.
Maybe this can be done as part of a python wrapper for dcmqi (something simple similar to https://github.com/AIM-Harvard/pyplastimatch). I would not want to create and maintain a package just for this conversion
Adding it to any small, lightweight native Python package (that does not need to be compiled and only depends on a few widely used, well-maintained packages) could work well. Creating such a package is very easy and its maintenance is barely any commitment (you just enable some default github actions to run tests and create and upload packages on PyPI).
The commitment for maintaining compiled C++ packages is so much higher (we need to keep updating the build system for new Python versions and other dependencies on all operating systems, build and upload the wheels, etc.) that I would only consider putting these small converter scripts into such a complex/large/compiled package only if that package is so essential for many projects that we know for sure that there will be always people to continue to support it - even if you don't have funding or interest in maintaining it anymore. I don't think pyplastimatch has reached this level. It would be better to add it to a more essential Python package, or add it a small new native Python package.
selection of colors requires at least some effort and is probably of interest to other TotalSegmentator users
This is an interesting question. Color is really application-dependent. In many cases you want all bones to appear with the same or barely different color, in many other cases you would prefer displaying cervical/thoracic/lumbar with slightly different colors, while if you are interested in the spine itself then you may want to assign different color to each level. There is no single color for a structure that would work well for everyone.
One possibility is to try to make a compromise - e.g., make the vertebrae colored similarly but not exactly the same - but this means that this will only work when users can accept suboptimal results. If someone does not want to accept compromise and wants to avoid redundancy then color mapping should be stored in a separate table.
There are other metadata fields, which should be added separately, such as common name for a structure, which may be depend on the clinical application and of course also on the country/language. There is even some ambiguity in the code values - the same term may be encoded in different ways even in the same terminology, and of there are also different terminologies. These could be all handled by having additional tables that would allow translation between different codes (between terminologies or within a terminology) and mapping them to metadata (common name, preferred color, etc. depending on clinical application, country, language, etc.).
Overall, having several simpler, independent tables that can be independently maintained by different people could be better than large, complex json files. Independent maintenance is important because nobody has all the necessary knowledge to create and maintain these tables: anatomists would know correspondence of codes between terminologies, maybe some names in some languages; clinicians would be able to tell what names and colors they would like to use in their clinical practice; maintainers of terminologies would know how to create translation tables to take care of changes in their codes; etc.
Thank you for the interesting observations. Nothing is simple!
Without the colors there is no duplication.
I think there is duplication. E.g., see these two - mapping of the codes is present in both JSON and CSV.
and
But I agree, this will need more thought. Right now I don't want to distract from the current main topic we are working on, but we should revisit this in the future!
JSON defines metadata for each terminology code. CSV provides a translation table from TotalSegmentator string ID to terminology code. So, I don't see any duplication. Of course the terminology code that the translation table defines must be present in the JSON file but that's not duplication, that's just the key that is used for looking up the metadata.
If we really want to find some redundancy between the current CSV and JSON then we could say that the optional "code meaning" part of the terminology code is redundant. Indeed, code meaning could be removed from the table, as it is already specified in the JSON and it is an optional part of the terminology code anyway.
Perhaps it makes more clear why color and other additional metadata does not belong to the terminology mapping table is if you think about how TA2 terminology code support would be implemented in TotalSegmentator. The obvious method would be to add another CSV file that maps from TotalSegmentator structure name to TA2 codes. If the table contains metadata, such as colors, international names, etc. then you would need to maintain all that duplicated metadata for all coding schemes. If metadata was stored in a single place, in the terminology JSON file, then there would be no redundant metadata storage and no maintenance headaches.
In term of next steps:
The best would be if TotalSegmentator provided terminology information via a function call. The function could take structure name (internal TotalSegmentator ID) as input and it would return the SNOMED code. When mapping to TA2 terminology is added then the function could take an optional "coding scheme designator" argument.
If this was implemented in TotalSegmentator then we would not need the CSV file in the Slicer extension at all. We had to add the CSV file to the Slicer extension because last time I checked the SNOMED mapping CSV was not even included in the TotalSegmentator Python package.
It would be nice to have a set of scripts that can consume a set of terminology translation table files and metadata specification files (that specifies some metadata for a code) and would provide functions like:
These terminology translation table CSV files could be standardized, they could be just a list of corresponding codes in different coding schemes. We could define a coding scheme designator for TotalSegmentator (e.g., 99TOTALSEG
) and use the structure names as codes. Similarly, every project could define its own local coding scheme and provide a translation table.
For terminology metadata files we could try to keep using our current JSON files, just not to invent anything new. But perhaps it could be better to simplify and just use CSV files for metadata, too (just a simple table with columns that contain the code and metadata fields).
Currently colors are stored in the JSON file, which also contains terminology codes. We merged them into the terminology, as you can see in https://github.com/ImagingDataCommons/Cloud-Resources-Workflows/blob/main/configs/TotalSegmentator/totalsegmentator_snomed_mapping_with_partial_colors.csv. If you like @lassoan, we can make a PR to update the CSV in this repo as well. We will probably make that PR to the upstream TotalSegmentator repo, since colors are not available there at all, and if that one is accepted, we can just use the mapping file for both terminology and colors from the upstream repo instead of maintaining it separately.
cc: @vkt1414