programminghistorian / ph-submissions

The repository and website hosting the peer review process for new Programming Historian lessons
http://programminghistorian.github.io/ph-submissions
137 stars 111 forks source link

Tradução da lição "OCR e Tradução Automática" #371

Open joanacvp opened 3 years ago

joanacvp commented 3 years ago

O Programming Historian em português recebeu a seguinte proposta de tradução da lição 'OCR e Tradução Automática' por @felipelmc.

A lição a traduzir está no link: https://github.com/programminghistorian/ph-submissions/blob/gh-pages/pt/esbocos/traducoes/ocr-e-traducao-automatica.md

Esta lição está agora em revisão e pode ser lida em: http://programminghistorian.github.io/ph-submissions/pt/esbocos/traducoes/ocr-e-traducao-automatica

Para promover uma publicação rápida desta lição traduzida, eu @joanacvp irei procurar uma revisão da mesma junto de dois revisores o mais breve possível.

Se houver alguma preocupação dos tradutores, eles podem entrar em contato com o mediador do PH em português (Luís Ferla).

joanacvp commented 3 years ago

Hello @svmelton. Hopefully you can help us. @felipelmc, who did the translation of this lesson, had some problems with it. He will detail you in this issue what happened. I am also having problems in loadling the lesson. It always says there was a failure and I can't find why. Thank you very much for your support in advance

felipelmc commented 3 years ago

Hello @svmelton! As @joanacvp just said, I had some problems throughout the translation. For some reason, I could not make the translation API used in the lesson (Yandex) work. When I typed the code, I received many error messages ("Session is invalid", "Null response" or "Oops! Something went wrong and I can't translate it for you :(") as output. If I change the API to Google or Bing, as in the commands bellow, the translation goes on normally:

trans -e google :eng file://INPUT_FILENAME > OUTPUT_FILENAME trans -e bing :eng file://INPUT_FILENAME > OUTPUT_FILENAME

The problem is that, by changing the API, the accuracy level of the translation gets reduced, and the translation itself suffers substantial changes. Also, some problems the author tries to fix during the lesson go away, which would make it impossible to change the API without changing the content of the lesson.

Other problems that had to be dealt with during the translation had to do with my perception that some commands did not seem to be doing what they were supposed to do. In the case of the Nano file bellow, I could not see any difference between the input image and the output image:

#!/bin/bash
read -p "enter file name: " fl;
convert $fl -despeckle -despeckle -despeckle -despeckle -despeckle $fl

Finally, I could not make the following codes work:

mkdir $folder"_ocr"
mkdir $folder"_translation"
mv *_ocr.txt *_ocr
mv *_trans.txt *_translation

and

#!/bin/bash 
read -p "enter archive name: " $archive_name;
read -p "enter  date of visit: " $visit;

ls -lt | awk '{if ($6$7==$visit) print $9}' >> list.txt
mkdir $archive_name

for i in $(cat list.txt);
do 
  mv $i $archive_name/$archive_name${i:3}; 
done

The first block of command should move transcribed/translated/edited files to different folders on my computer. The second one should change the name of the files according to the date of visit of an archive. Unfortunately I could not make any of them work.

By the way, I am working from a Macbook. I do not know if the commands did not work properly because of my computer, if I have done something wrong or anything else.

I would like to thank you in advance for your support and attention!

rivaquiroga commented 3 years ago

@joanacvp, maybe the @programminghistorian/technical-team can help with the problems with the build.

walshbr commented 3 years ago

The build should be working now. The issue was that you used double quotes inside of a caption for a couple images -

{% include figure.html filename="OCR-e-traducao-automatica-2.png" caption="Figura 2: A frase com "coruja" (owl) em russo" %}

The double quotes around coruja are a problem, because the script assumes they signal the end of the caption. I changed them to single quotes, and it should be working now - https://github.com/programminghistorian/ph-submissions/commit/35c4fd1c74f4b06d36dbfabbd885e28012ceadd4

DanielAlvesLABDH commented 3 years ago

Thank you @walshbr

felipelmc commented 3 years ago

Oh! That makes total sense. Thank you very much @walshbr

joanacvp commented 3 years ago

Thank you very much for your help @walshbr. I spent hours trying to understand what was wrong :). Who should we contact regarding the problems with the lesson @felipelmc stated? Thank you once again

walshbr commented 3 years ago

If there were problems with the original lesson you'd probably ping the managing editor for that language publication. So @svmelton in this case I think? Unless I'm misunderstanding.

rivaquiroga commented 3 years ago

@joanacvp, @felipelmc, you can open a ticket here to report the bug: https://github.com/programminghistorian/jekyll/issues

svmelton commented 3 years ago

Thanks @walshbr! @hawc2, would you be able to do an initial assessment of the issue described above? It may be that we need to contact the author.

hawc2 commented 3 years ago

I took a first shot at this tutorial and am running into a lot of problems, including the ones cited by @felipelmc.

I fear the main issue may be this was composed for Linux and does not exactly translate to Unix, but the lesson introduction doesn't make this clear in any detail. Even installing Tesseract on Mac makes more sense to me using 'brew install' than how it is currently listed in the tutorial.

I'm happy to try to debug this in more detail, but reaching out to the author for clarification may help the process.

joanacvp commented 3 years ago

@Anisa-ProgHist we also reported problems with this lesson. Should we proceed the same and open an issue in jekyll? Thank you very much for all your help

anisa-hawes commented 3 years ago

Yes, please @joanacvp! The Lesson Maintenance workflow is explained step-by-step on our Wiki.

Open a new Issue within Jekyll including the following key information:

Please add the label Lesson Maintenance, and I will keep an eye out for it.

It would be useful if you could reference this Issue https://github.com/programminghistorian/ph-submissions/issues/371 within the new one that you open in Jekyll, so that I can look back through this Conversation.

@hawc2 Did you make any steps forward with this? I can see in your note (above) that you were intending to assess and contact the author. Let me know if you have any notes to share.

hawc2 commented 3 years ago

@Anisa-ProgHist I never heard back from the author. Working with them to deal with the troubleshooting seems the right way to go. Happy to help when we hear back

anisa-hawes commented 3 years ago

Thank you, @hawc2! That sounds sensible. I would much appreciate their + your support to solve this one (as it sounds tricky).

DanielAlvesLABDH commented 2 years ago

Hi @akhlaghiandrew do you think you can help us understand what is the problem with our translation of your lesson? Are we missing something? Thanks in advance!

rivaquiroga commented 2 years ago

Hi, everyone! I gave this lesson a try yesterday to see if it was a good idea to translate it to Spanish.

I'm a Linux user. My impression is that the lesson was written for Mac, because the way to install tesseract in Ubuntu, for example, is slightly different, and because it uses MacPorts (i.e., the sudo port commands). I had to make a couple of changes to make the instructions work.

Regarding the problems @felipelmc found:

1. Yandex didn't work for me either, and looking around, I found an open issue on the Translate Shell repository. It's from mid 2020, and it doesn't look like it has been solved. Maybe we should add a warning suggesting readers to use bing or google, and to expect that results might be a little different from the ones shown in the lesson.

2. I also didn't perceive any difference after using the script that runs the despeckle command five times. I edited it, and only when using -despeckle around 30 times I start seeing some changes:

Captura de pantalla de 2021-10-02 21-36-52

3.

mkdir $folder"_ocr"
mkdir $folder"_translation"
mv *_ocr.txt *_ocr
mv *_trans.txt *_translation

To make this part work you have to run the script from your target folder in the Terminal (where the PDFs are). If you run it from the parent directory (or any other place), it won't work because the script creates the new folders in the working directory and will look for the .txt files there.

For example, in this case my folder structure was something like this:

OCR-lesson
      ¦--loop.sh
      ¦--pdf/
          ¦--120500.pdf
          ¦--119105.pdf

So I opened the Terminal in the pdf/ folder and ran the script with ../loop.sh (not ./loop.sh, as suggested) because it was in the parent folder (OCR-lesson/).

It might be a good idea to tell the readers that the script will prompt an error if you have files that are not OCRable in your working directory. It will work for the image files and pdfs, but will show you an error for some formats (for example, if you have a bash script there).

4. It looks like that the author is not really expecting you to run the script for renaming the files with the images from the lesson, but with some other files you might have, and which are named according to the example he shows (something like IMG_XXXXX.png). My overall impression is that the script needs a little more explanation about how it works. For example, there is no information about the format in which you are supposed to provide the date or how to find out which format you need (in case you are not familiar with awk). For example, in my case it was oct2 (the three letters abbreviation of the month in Spanish and the day number). My first instinct was to try ISO 8601 format. If your images' current names are slightly different from the example (IMG_XXXXXX.png), the script won't work as expected. Currently it changes the first three characters of the current name with the archive name you provide. But if your image does not start with three letters it will result in something not as neat as archive_name_XXXXXXX.png. So you need to adapt this part of the script in case you have something different $archive_name${i:3}.


@anisa-hawes, maybe you want to discuss with @svmelton about changing the difficulty of the lesson from Medium to Advance. There are a lot of things that you are supposed to know how to do that you cannot learn just by following the two lessons that are suggested as preparation for this one. Also, the lesson expects you to infer some intermediate steps between one instruction and the next one.

BTW, Nano can be installed in Windows, not only in Mac and Linux. But it is a very old release (2017). Maybe we should point Windows users to that version or at least explain how they can work through the section "Putting it all together with a loop" without Nano (e.g., they can use a text editor and change the file extension to .sh). And it might be a good idea to suggest people to have at least Image Magick version 7. With 6.x versions you might find an error when trying to convert PDFs.

akhlaghiandrew commented 2 years ago

I'm happy to help with updating the lesson. I didn't know this was still an open issue.

DanielAlvesLABDH commented 2 years ago

@akhlaghiandrew many thanks for your availability. Also @rivaquiroga your work in reviewing this is amazing. Many thanks. @joanacvp will you follow this, please? Thank you

joanacvp commented 2 years ago

@DanielAlvesLABDH yes, I will follow this

anisa-hawes commented 1 year ago

Hello all,

Please note that as part of a reorganisation of the /pt directory, this lesson's .md file has been moved to a new location within our Submissions Repository.

It is now found here: https://github.com/programminghistorian/ph-submissions/blob/gh-pages/pt/esbocos/traducoes/ocr-e-traducao-automatica.md

A consequence is that this lesson's preview link has changed. It is now: http://programminghistorian.github.io/ph-submissions/pt/esbocos/traducoes/ocr-e-traducao-automatica

Please let me know if you encounter any difficulties or have any questions. Very best, Anisa

DanielAlvesLABDH commented 1 year ago

@joanacvp temos possibilidade de avançar com esta tradução ou os erros apontados continuam a dificultar?

joanacvp commented 1 year ago

@joanacvp temos possibilidade de avançar com esta tradução ou os erros apontados continuam a dificultar?

Boa tarde @felipelmc. Será que a API de tradução já está funcional para conseguirmos avançar com esta lição? Muito obrigada!

DanielAlvesLABDH commented 1 year ago

Olá @joanacvp e @felipelmc. Espero que esteja tudo bem. Acham que podemos avançar com esta tradução e revisão brevemente? Obrigado

DanielAlvesLABDH commented 10 months ago

Olá @joanacvp e @felipelmc, qual é o estado desta tradução? Acham que será melhor abandonar a sua publicação? Obrigado

felipelmc commented 10 months ago

Caros, espero que estejam bem! Peço desculpas pela demora para responder às mensagens aqui no GitHub. Infelizmente estou com pouco tempo para me dedicar aos testes que essa tradução demanda, mas fico à disposição para eventuais revisões.

joanacvp commented 10 months ago

Bom dia. Devemos aguardar a implementação das correções na versão original e abandonar para já a publicação, retomando mais tarde? Qual seria a sua opinião e disponibilidade @felipelmc ?

felipelmc commented 10 months ago

Acredito que esse seja o procedimento ideal, @joanacvp. Reforço que me mantenho à disposição para revisar a lição, mas infelizmente não conseguiria fazer a adaptação do conteúdo.

DanielAlvesLABDH commented 10 months ago

Tendo em conta isso e agradecendo todo o trabalho já desenvolvido, acho que é melhor aguardar e colocar este processo de revisão em suspenso. Obrigado a todos/as

joanacvp commented 10 months ago

Hi @anisa-hawes! We decided to wait until the corrections are implemented in the original version. Should I close the issue or keep it open and label it as Sustainability + Accessibility? Thank you in advance :)