sisimai / go-sisimai

Go language version of Sisimai: Under the development and may be released this summer
BSD 2-Clause "Simplified" License
12 stars 0 forks source link

Use common data between Perl/Ruby/Go versions (eg. JSON)? #13

Open bohwaz opened 2 weeks ago

bohwaz commented 2 weeks ago

Hi,

just was this new Go version thanks to your reply. I was interested in reusing some of the data from Sisimai in my PHP code, and I'm wondering why you didn't go the route of having a common set of data that could be reused between the various language versions of Sisimai?

For example you could have a set of JSON files that list the various strings and pairs for matching reasons, like I extracted here: https://gist.github.com/bohwaz/9c5b8354089a15033ea1a97a267cabfb#file-reasons-json

And probably the same with other parts, like matching Rhost errors, which are mostly just large arrays. Some parts might be too complex to make "generic" like in Lhost. But having the same common data for each library would avoid having to replicate changes on strings from one library to the other, reducing duplicate code efforts.

For example you would have a "sisimai-data" repo that would be pulled by various libraries to have up to date data for matching reasons, Rhost, etc.

Maybe there is an obvious reasons you didn't go this route that I can't see right now?

Anyway, thank your for your work, very interesting and useful :)

azumakuniyuki commented 2 days ago

@bohwaz Apologies for the delayed response.

I think that unifying fixed strings, starting with error message patterns, is a good idea. After initially releasing the Perl version of Sisimai, I created a Ruby version on a whim because I wanted to run it on AWS Lambda. At that time, I separated set-of-emails as a repository for test emails common to both.

Since error messages were implemented using a large number of regular expressions at the time, I was concerned that using them in a common external file would cause excessive I/O at runtime and slow things down. Therefore, I decided that hardcoding them in the repository was the most reasonable approach. I thought that since error message patterns are rarely updated, I could just copy them if needed.

Now that all error message patterns have been changed to fixed strings, I may reconsider this if it doesn't cause any performance issues, including I/O. However, my current thinking is that I strongly prefer to keep all files necessary for installation, testing (make test), and execution in a single repository.

By the way, the process of copying and pasting changes to a separate repository, while seemingly unproductive, acts as a self-contained code review and can surprisingly lead to finding improvements in the code.

Thank you for your ideas and feedback!

bohwaz commented 2 days ago

Thank you, all perfectly understandable points.

Maybe a solution would be to have a central repo of strings, and each library could generate a native source file from this repo, eg. a hash table, which would be versioned in git. This way you would have zero performance issue as the strings would be in the code, but you wouldn't have to manually match the strings between different libraries.

This would also work for your requirement to keep all files in the same repo, as you would have a copy of the strings in each repo.