sigven / pcgr

Personal Cancer Genome Reporter (PCGR)
https://sigven.github.io/pcgr
MIT License
251 stars 47 forks source link

Extend sample ID length limits #224

Closed MareikeJaniak closed 2 months ago

MareikeJaniak commented 4 months ago

Good afternoon!

We are generating PCGR reports as part of our pipeline and have occasionally run into an issue when sample IDs are longer than 35 characters:

2024-04-16 02:00:02 - pcgr-validate-arguments-input - INFO - PCGR - STEP 0: Validate input data and options
2024-04-16 02:00:02 - pcgr-validate-arguments-input - ERROR - 
2024-04-16 02:00:02 - pcgr-validate-arguments-input - ERROR - Sample name identifier ('--sample_id' = ASDF-VB-23-34-000000452384B2-15255XY) must be between 2 and 35 characters long
2024-04-16 02:00:02 - pcgr-validate-arguments-input - ERROR - 

The sample IDs are outside of our control and can't be changed due to the sample tracking system that we have in place. We have come up with a work-around that shortens sample IDs longer than 35 characters just for the purposes of running PCGR and then renames the output files back to the actual sample ID, so they can be tracked within our system.

For future releases, we were just wondering if it would be possible to increase the length limit, or perhaps set the sample ID for the report title separately from the output file prefix?

Thanks!

Best, Mareike

sigven commented 4 months ago

Hi Mareike,

Thanks for reaching out. I truly understand your need, we will experiment a bit to see how such long sample names could fit in the new version we are working on. The length limitation was set due to visual purposes primarily. I'll get back to you shortly with some examples of how it may look in the new version, ok? Out of curiosity, do you happen to know the maximal character length of your sample identifiers? I noticed the one above is 36.

kind regards, Sigve

MareikeJaniak commented 4 months ago

Hi Sigve,

Thanks for your quick response!

I totally understand that there are visual concerns with having very long sample names. The sample name displayed in the output itself isn't a big concern for us and we are okay with truncating it for those purposes, but we have to keep the full sample name in the output file name, for tracking purposes. Maybe a solution could be an option that allows setting the sample ID displayed in the output and the output file prefix separately?

So far, all of the problematic sample names have been just 1-2 characters over the limit of 35. We have also communicated to the project manager that such long names aren't ideal, but because of the size of the project, some of it is outside of our control.

Like I said, we have found a work-around in our pipeline for now, by truncating the sample ID for PCGR and then renaming the output files, so this isn't an urgent issue, by any means! But I appreciate that you're considering it!

Best, Mareike

sigven commented 4 months ago

Here is a glimpse of how it will look in the upcoming version (for a dummy sample), seems to work ok.

https://www.dropbox.com/scl/fi/p22jglyyzwnbaqjdj5762/ASDF-VB-23-34-000000452384B2-15255XY.pcgr.grch37.html?rlkey=d3w9erg7mqm3fkxyf1i7bd50g&dl=0

best, Sigve

sigven commented 2 months ago

Fixed as addressed in upcoming release.

MareikeJaniak commented 2 months ago

Thank you. Much appreciated!