teamtomo / starfile

STAR file I/O in Python
https://teamtomo.org/starfile/
BSD 3-Clause "New" or "Revised" License
44 stars 19 forks source link

string value columns are converted to number vlues #37

Closed logicvay2010 closed 8 months ago

logicvay2010 commented 1 year ago

Hi Alister,

I found the starfile program quite useful for my development! However, when read the "star" file, I found some columns supposed to be String values that can be converted to numbers in the Dataframe. In my example, the '01' for 'rlnTomoName' column is read as number 1 which could cause an issue when matching with the original tomogram. So could make an update on this? Thanks!

Best, Hui

alisterburt commented 1 year ago

Hey Hui, I’m glad you’re finding the package useful!

I’m letting pandas decide which columns are numeric by running pd.to_numeric and seeing if it errors - columns which raise an error are left as strings.

How would you like to see the proposed feature exposed to users? We could maybe add a list of column names to not auto-cast in starfile.read

This isn’t high priority for me but I would be happy to review a PR if you’re keen to see it added!

logicvay2010 commented 1 year ago

Hi Alister,

Thanks for the response! I am writing a GUI program for placing back sub-tomograms into the original tomograms, so I need to specify certain tomo names that appear in the star file to be generated. Sure, it's not an urgent thing. I just feel that it makes more sense to at least add some guaranteed not-numbered columns to be skipped during the read. An easy way to do that might be pre-scanning in _parse_loop_block(self) -> pd.DataFrame, if the column name is in a list of save words then just keep them as they were. ["rlnTomoName"].

It's just happened when I want just simplify my tomoName, I agree this scenario is not common. Technically, the name can be named after a number, and adding that hard-coded feature would make this program "safer" to use. Of course, it can be written in some constant-only file and be imported into that parser function so it can be extended afterward.

By the way, I am using imodmodel too, excellent work!

Thanks! Hui

EuanPyle commented 9 months ago

Hey Alister, I've actually just had a bug report from RELION repeating this issue (https://github.com/3dem/relion/issues/1079) where the user had named their TS only numbers (e.g. '0035') which starfile converts to '35' so during TSA, RELION can't find the correct rlnTomoName. Maybe it's worth re-visitng this issue? I imagine there will be lots of people new to tomography using the new v5 workflow so this mistake might become more common!

alisterburt commented 9 months ago

Yeah we can definitely implement the solution outlined above, not hitting named columns with pd.to_numeric -> I'm not sure when I'll get to this but would happily accept a PR if you want to spend the time!

EuanPyle commented 9 months ago

No probs, I'll shoot a PR when I get a second!

alisterburt commented 8 months ago

@logicvay2010 @EuanPyle's PR with this feature just got merged, available in starfile>=0.5.6

logicvay2010 commented 8 months ago

@logicvay2010 @EuanPyle's PR with this feature just got merged, available in starfile>=0.5.6

Awesome! Thanks, @EuanPyle !