Rewrite/refactoring? - Githubissues

Hello Michal,

thank you for this project! Having some time on my hand, I'm thinking about rewriting your prototype code (your words!). But before I proceed, I have some remarks, questions and of course need feedback:

Major goal of my rewrite is to make this as usable as possible for endusers:

Easy to use command-line program with everything set by options or interactive (see #38)
Implementation of a config system to enable users to add other phrasebooks from Google Doc by themselves without the need to write real code

The biggest obstacle right now is the non-consistency of the phrasebooks on Google Docs. This makes parsing painful, error-prone and different for each phrasebook. I think that this problem could be solved in large parts by introducing a defined row/rows with language codes into the spreadsheets, as you mentioned it by yourself in various issues (#15, #31, #33, #40 for example). This would also allow to dynamically skip past the comments at the beginning and easy retrieving of the used languages per spreadsheet. Are there any plans to do this?

I think this project could be done in Bash, but would be way easier in another language like Ruby, Python or Go. All of these languages are better suited for this problem, could speed up the development and, as a major plus, add Windows as a plattform. Ruby and Python are pretty easy to write and read, so there are a lot of potential developers. But it can be hard to install and run the code. Go on the other hand is more unknown, so less potential developers, but you can compile static executables without any dependencies for all important plattforms. Endusers would really benefit from this, I think, so it is my favorite.

I forked this repo and began the refactoring in brach "refactor": https://github.com/sgelb/refugeePhrasebookCreator/tree/refactor. Right now, the fork does not work, but I think you can see the direction I'm taking so far.

I have a lot of other questions and suggestions, but for now, this is enough. Does this make any sense? Would it help? What do you think?

Hello sgelb,

thank you for your help and feedback.

As of the google-docs - I hope the authors will implement it soon.

As of the language to use: I know there are many other choices. One reason to use bash is - it has less dependencies (thinking of Linux and Mac). So it is done by system-tools available on most of the mentioned systems. And for Windows there are runtime-environments.

In any case people need to download some files and install them. So the dependency for Windows-Users is extended to a bash-environment.

Another reason I wanted to do it with bash: I know more bash than Ruby, Python, Go and so many others.

It will be great to have wrappers like you are writing it. It will be easier for users. And we can build up some front-ends on that.

May I invite you to this one https://github.com/refugee-phrasebook/backendscripts/issues/47

The goal would be to create a running system that exports the formats mentioned.

I know - there is a lot of cosmetic work to do :)

Thank you for your help!

Greetings

Michal

Hi @sgelb, please check out https://github.com/refugee-phrasebook/py_rpb for a python implementation

On 01.10.2015 11:56, sgelb wrote:

Hello Michal,

thank you for this project! Having some time on my hand, I'm thinking about rewriting your prototype (your words!) code. But before I proceed, I have some remarks, questions and of course need feedback:

Major goal of my rewrite is to make this as usable as possible for endusers:

Easy to use command-line program with everything set by options or interactive (see #38)

Implementation of a config system to enable users to add other phrasebooks from Google Doc by themselves without the need to write real code

The biggest obstacle right now is the non-consistency of the phrasebooks on Google Docs. This makes parsing painful, error-prone and different for each phrasebook. I think that this problem could be solved in large parts by introducing a defined row/rows with language codes into the spreadsheets, as you mentioned it by yourself in various issues (#15, #31, #33, #40 for example). This would also allow to dynamically skip past the comments at the beginning and easy retrieving of the used languages per spreadsheet. Are there any plans to do this?

I think this project could be done in Bash, but would be way easier in another language like Ruby, Python or Go. All of these languages are better suited for this problem, could speed up the development and, as a major plus, add Windows as a plattform. Ruby and Python are pretty easy to write and read, so there are a lot of potential developers. But it can be hard to install and run the code. Go on the other hand is more unknown, so less potential developers, but you can compile static executables without any dependencies for all important plattforms. Endusers would really benefit from this, I think, so it is my favorite.

I forked this repo and began the refactoring in brach "refactor": https://github.com/sgelb/refugeePhrasebookCreator/tree/refactor. Right now, the fork does not work, but I think you can see the direction I'm taking so far.

I have a lot of other questions and suggestions, but for now, this is enough. Does this make any sense? Would it help? What do you think?

Reply to this email directly or view it on GitHub: https://github.com/refugee-phrasebook/backendscripts/issues/46

Hello,

thank you for your answers.

I've done some work on my version and would be happy to get some feedback. Have a look at https://github.com/sgelb/refugeePhrasebookCreator. Although there are still a lot of things to do, I would consider it usable and a solid base for further development.

Main features:

Use of command-line options as wished by @andrecastro0o in #38
Built-in configurations for phrasebooks available on http://www.refugeephrasebook.de:
- Basic conversation for refugees
- Basic conversation for helpers
- Basic conversation, short version
- Medical phrasebook
- Juridical phrasebook
Configuration system to add your own phrasebooks from GoogleDocs
Template file for simple changing of output
Automatic detection of arabic, bengal, cyrillic, ge'ez, greek and latin writing systems. More will come.
Fonts included, no installation neccessary
Automatic calculation of landscape or portrait format

Features lacking:

Option to show available languages and possibility to set them as an argument
Better pdf-layout
Options for paper size and for manually setting orientation
More stuff, see ToDo-section in README.md

I'd be happy if we could merge our efforts. How should we proceed?

Hello sgelb,

i like it and i would like to reuse parts of it.

Before i can do this: Licensing is MIT (https://github.com/refugee-phrasebook/refugee-phrasebook.github.io/blob/master/LICENSE.md)

Maybe it's time to explain some of my thoughts

split each task to a different script so we can call them from a MAIN-wrapper (like the current runall.sh)
easier to understand: every script does what it says it is doing
having the same script for different systems like a BSD,Linux,MAC,...-version (eg: 06_replace_tabulator_with_ampersand_MAC.sh is the MAC-version) - it is somehow crazy to use the code twice, but as we are running on different systems we can merge these different scripts afterwards (on a MAC -A is unknown-> rpb2pdf.sh: line 34: declare: -A: invalid option )
a good MAIN-wrapper: catch the $MACHTYPE, $OSTYPE and $BASH_VERSINFO and call the right scripts according to naming
creating temp-files and keeping them for later reuse will speed up the creation of different versions (you have the columns already prepared as files in the temp-directory) - just do the replacement-jobs (for all columns once) and join the columns -- this is a must have for batch-creating multiple versions: n=50, r=4 and order is not important: 230300 possible language-versions. Using 5 languages: 2118760 possible combinations. So speed will be an issue: just join the columns as we need them (as everything else is already prepared and ready for use)
having these temp-files prepared we can do the replacement and joining for mediawiki-export https://github.com/refugee-phrasebook/backendscripts/issues/37 -- so we can reuse data and scripts, do the replacement-job (this time for mediawiki-export) and join the columns as described before
language detection might be easier using header-data like with this temporary-fix (https://github.com/refugee-phrasebook/backendscripts/issues/50) - Language-detection and splitting: partially in 05_get_the_columns.sh
language-font-mapping will be implemented using something like this https://github.com/refugee-phrasebook/backendscripts/issues/43 (so we can group eg arabic-script-languages to arabic. For farsi we use the farsi-prefixes as we have a polyglossia-mapping)
Collecting compatible fonts here https://docs.google.com/spreadsheets/d/1nXNhjvuVb7CcaJBsopXa2V1xXW4Nvzdz4y93QAKZEnM/edit#gid=0

I will create issues from this as it's just a long list we can split in single tasks.

My thoughts. I'll number them for better referencing.

I prefer to split tasks in functions, not files. This also allows a better use of variables to handle data in memory instead of needing to write and read to/from files between tasks.
Writing the same code again and again is no good style, very painful to maintain and makes it hard for others to get into the code. I'd prefer to check for required program versions and output an error message instead of rewriting a lot of code for each plattform. The used tools should work the same on Linux and OS X, this can be ensured by checking the versions. For example, OS X comes with feature lacking versions of sed (see your tab problem) and grep (no support for pcre). Installing the GNU-versions solves those problems.
declare -A requires Bash 4
Keeping temporary data to avoid reloading data makes sense to save bandwith. Other than that, I don't see any real benefit. On my 6 year old computer, creating the phrasebook Basic conversation for refugees with 5 languages and over 500 columns takes ~13 seconds in total: downloading (~4s) , parsing (~1s) and rendering the pdf (~8s). Even if it took just 0.1 second it would take 2.4 days to create 2118760 versions, so speed is not really a matter.
Supporting other (text-)formats like mediawiki should be easy.
I thought about language detection and came to two conclusions:
- Users need an easy way to choose the languages to include. For this, a header row with ISO-639 data is very useful, but using the already provided row of (sometimes incorrect) language names should be sufficient and saves the work to manually alter all the spreadsheets.
- For correct rendering in Tex, we don't really care about the language, it's about the writing system/alphabet. I think this problem is solved in my script, see the comment
FreeSerif from GNU FreeFonts is very promising. It renders at least arabic, bengali, ge'ez, cyrillic, greek and latin script. The list of supported writing systems is very long (see the right side on their page). And it's GPL3-licensed.
I licensed my code under the GPLv3 and recommited the repo as it is not really a fork.

To be honest, I think our project may target the same goal, but use quite different ways. It's hard to work together without knowing each others preferences about project and code structure, goals and milestones and how to proceed together. But as these are rather small projects, I do not see any problems in running them in parallel. The worst that can happen is that users have two programs at their hand and we'll still benefit from each other.

refugee-phrasebook / backendscripts

Rewrite/refactoring? #46