refugee-phrasebook / backendscripts

1 stars 0 forks source link

Rewrite/refactoring? #46

Closed sgelb closed 6 years ago

sgelb commented 9 years ago

Hello Michal,

thank you for this project! Having some time on my hand, I'm thinking about rewriting your prototype code (your words!). But before I proceed, I have some remarks, questions and of course need feedback:

Major goal of my rewrite is to make this as usable as possible for endusers:

The biggest obstacle right now is the non-consistency of the phrasebooks on Google Docs. This makes parsing painful, error-prone and different for each phrasebook. I think that this problem could be solved in large parts by introducing a defined row/rows with language codes into the spreadsheets, as you mentioned it by yourself in various issues (#15, #31, #33, #40 for example). This would also allow to dynamically skip past the comments at the beginning and easy retrieving of the used languages per spreadsheet. Are there any plans to do this?

I think this project could be done in Bash, but would be way easier in another language like Ruby, Python or Go. All of these languages are better suited for this problem, could speed up the development and, as a major plus, add Windows as a plattform. Ruby and Python are pretty easy to write and read, so there are a lot of potential developers. But it can be hard to install and run the code. Go on the other hand is more unknown, so less potential developers, but you can compile static executables without any dependencies for all important plattforms. Endusers would really benefit from this, I think, so it is my favorite.

I forked this repo and began the refactoring in brach "refactor": https://github.com/sgelb/refugeePhrasebookCreator/tree/refactor. Right now, the fork does not work, but I think you can see the direction I'm taking so far.

I have a lot of other questions and suggestions, but for now, this is enough. Does this make any sense? Would it help? What do you think?

michal-fre commented 9 years ago

Hello sgelb,

thank you for your help and feedback.

As of the google-docs - I hope the authors will implement it soon.

As of the language to use: I know there are many other choices. One reason to use bash is - it has less dependencies (thinking of Linux and Mac). So it is done by system-tools available on most of the mentioned systems. And for Windows there are runtime-environments.

In any case people need to download some files and install them. So the dependency for Windows-Users is extended to a bash-environment.

Another reason I wanted to do it with bash: I know more bash than Ruby, Python, Go and so many others.

It will be great to have wrappers like you are writing it. It will be easier for users. And we can build up some front-ends on that.

May I invite you to this one https://github.com/refugee-phrasebook/backendscripts/issues/47

The goal would be to create a running system that exports the formats mentioned.

I know - there is a lot of cosmetic work to do :)

Thank you for your help!

Greetings

Michal

Glottotopia commented 9 years ago

Hi @sgelb, please check out https://github.com/refugee-phrasebook/py_rpb for a python implementation

On 01.10.2015 11:56, sgelb wrote:

Hello Michal,

thank you for this project! Having some time on my hand, I'm thinking about rewriting your prototype (your words!) code. But before I proceed, I have some remarks, questions and of course need feedback:

Major goal of my rewrite is to make this as usable as possible for endusers:

  • Easy to use command-line program with everything set by options or interactive (see #38)
  • Implementation of a config system to enable users to add other phrasebooks from Google Doc by themselves without the need to write real code

The biggest obstacle right now is the non-consistency of the phrasebooks on Google Docs. This makes parsing painful, error-prone and different for each phrasebook. I think that this problem could be solved in large parts by introducing a defined row/rows with language codes into the spreadsheets, as you mentioned it by yourself in various issues (#15, #31, #33, #40 for example). This would also allow to dynamically skip past the comments at the beginning and easy retrieving of the used languages per spreadsheet. Are there any plans to do this?

I think this project could be done in Bash, but would be way easier in another language like Ruby, Python or Go. All of these languages are better suited for this problem, could speed up the development and, as a major plus, add Windows as a plattform. Ruby and Python are pretty easy to write and read, so there are a lot of potential developers. But it can be hard to install and run the code. Go on the other hand is more unknown, so less potential developers, but you can compile static executables without any dependencies for all important plattforms. Endusers would really benefit from this, I think, so it is my favorite.

I forked this repo and began the refactoring in brach "refactor": https://github.com/sgelb/refugeePhrasebookCreator/tree/refactor. Right now, the fork does not work, but I think you can see the direction I'm taking so far.

I have a lot of other questions and suggestions, but for now, this is enough. Does this make any sense? Would it help? What do you think?


Reply to this email directly or view it on GitHub: https://github.com/refugee-phrasebook/backendscripts/issues/46

sgelb commented 9 years ago

Hello,

thank you for your answers.

I've done some work on my version and would be happy to get some feedback. Have a look at https://github.com/sgelb/refugeePhrasebookCreator. Although there are still a lot of things to do, I would consider it usable and a solid base for further development.

Main features:

Features lacking:

I'd be happy if we could merge our efforts. How should we proceed?

michal-fre commented 9 years ago

Hello sgelb,

i like it and i would like to reuse parts of it.

Before i can do this: Licensing is MIT (https://github.com/refugee-phrasebook/refugee-phrasebook.github.io/blob/master/LICENSE.md)

Maybe it's time to explain some of my thoughts

I will create issues from this as it's just a long list we can split in single tasks.

sgelb commented 9 years ago

My thoughts. I'll number them for better referencing.

  1. I prefer to split tasks in functions, not files. This also allows a better use of variables to handle data in memory instead of needing to write and read to/from files between tasks.
  2. Writing the same code again and again is no good style, very painful to maintain and makes it hard for others to get into the code. I'd prefer to check for required program versions and output an error message instead of rewriting a lot of code for each plattform. The used tools should work the same on Linux and OS X, this can be ensured by checking the versions. For example, OS X comes with feature lacking versions of sed (see your tab problem) and grep (no support for pcre). Installing the GNU-versions solves those problems.
  3. declare -A requires Bash 4
  4. Keeping temporary data to avoid reloading data makes sense to save bandwith. Other than that, I don't see any real benefit. On my 6 year old computer, creating the phrasebook Basic conversation for refugees with 5 languages and over 500 columns takes ~13 seconds in total: downloading (~4s) , parsing (~1s) and rendering the pdf (~8s). Even if it took just 0.1 second it would take 2.4 days to create 2118760 versions, so speed is not really a matter.
  5. Supporting other (text-)formats like mediawiki should be easy.
  6. I thought about language detection and came to two conclusions:
    • Users need an easy way to choose the languages to include. For this, a header row with ISO-639 data is very useful, but using the already provided row of (sometimes incorrect) language names should be sufficient and saves the work to manually alter all the spreadsheets.
    • For correct rendering in Tex, we don't really care about the language, it's about the writing system/alphabet. I think this problem is solved in my script, see the comment
  7. FreeSerif from GNU FreeFonts is very promising. It renders at least arabic, bengali, ge'ez, cyrillic, greek and latin script. The list of supported writing systems is very long (see the right side on their page). And it's GPL3-licensed.
  8. I licensed my code under the GPLv3 and recommited the repo as it is not really a fork.

To be honest, I think our project may target the same goal, but use quite different ways. It's hard to work together without knowing each others preferences about project and code structure, goals and milestones and how to proceed together. But as these are rather small projects, I do not see any problems in running them in parallel. The worst that can happen is that users have two programs at their hand and we'll still benefit from each other.