Closed tarelli closed 10 years ago
@tarelli thanks for writing this up.
@GasGen -- we think this may be a great little project for you following the data viz stuff as you have 1) an interest in Python and 2) an interest in seeing the website translated. This project leads you through writing a script that will make it easy for the site to be translated not only in to spanish, but any language. Are you up to the challenge? :)
Very interesting this challenge!, I accept this parallel work with Data Viz!
Awesome, thanks @GasGen :)
@tarelli, You're Welcome! :)
What think about Django (the Python FrameWork)?, allows the internationalization (I18N), look at this: https://docs.djangoproject.com/en/1.0/topics/i18n/ Though not think it's a good idea because that way we would have to put a lot of work on a new code I think. I'll try it said at the beginning: pyquery.
Btw I would advise generating the resource files from the website.
Input -> "it","website_folder" Output -> One resource file per every page with "it" extension automatically populated with IDs and text coming from the website (English version)
The algorithm would go through the html files in the website and for every DOM node where an ID is specified and the HTML content doesn't start with "<" (nested elements don't have text) create a key,value pair. In this way it would be a matter of translating the file without having to do a manual lookup of the ids. Just a suggestion.
Hi tarelli, and sorry for the delay, I'm working in the Data-Viz for polishing some details with help Stephen. Recently I watched the Pyquery library, and I have something for get the IDs of the website, but is only the basis, still missing a 'for loop' to obtain all the IDS, by now one id label of each type (for the missing loop, I'm thinking as limiting, namely obtain the range for the loop lenght).
from pyquery import PyQuery as pq
import os
os.system('clear')
d = pq(url="http://www.openworm.org")
i = 0
for p in d:
print d('p').eq(i).attr('id'),'='
print d('a').eq(i).attr('id'), '='
print d('div').eq(i).attr('class'), '='
i+=1
For see, if I can obtain all the IDs of the html, I test with a example range in the 'for loop':
for p in range(0,20):
print d('p').eq(i).attr('id'),'='
print d('a').eq(i).attr('id'), '='
print d('div').eq(i).attr('class'), '='
i+=1
And yes, works, but, I need to know the exactly range. Also, I have to ignore the labels without ID or class, because in the script the empty labels are printed as: "none".
The next steps are, limit the range, ignore the empty labels, write the IDs in a file and replace them in the website html.
If you see, I having problem with open an html file, for that and for this time, I use directly the url of the project, for test the script.
Hi sorry for the double post. Well, PyQuery does not have many documentation, but searching I found other Library, BeautifulSoup (4 is the actual version), is very easy to extract strings of the web!. Documentation: http://www.crummy.com/software/BeautifulSoup/bs4/doc/
Now I'm making a first test extracting string, for example like the title of the web:
from bs4 import BeautifulSoup #Import the BeutifulSoup.
html = open('OpenWorm.html', 'r') #read the html file.
soup = BeautifulSoup(html) #now, convert the html in to BeautifulSoup.
print soup.title #In this case, I'm extract the label title with the string in the middle.
The result of this is:
<title>OpenWorm</title>
But, also I can get only the string between the labels, adding ".string" in the print line:
print soup.title.string #In this case, I'm extract only the string between the label.
result:
OpenWorm
Other example, is extract all the 'a' labels(or other) with your content of the html:
print soup.find_all('a')
Ladies and Gentlemens, spanish: Señoras y Señores!, the script WORKS!
Really, BeautifulSoup is an excellent library, I wrote a script that allows the translate of the site from the terminal (console). How?, well, first I show the script:
#OpenWorm Translate the website! - Localisation script
#author: Gaston Gentile
from bs4 import BeautifulSoup
import os
os.system('clear')
print 'OpenWorm site translator'
print 'Let\'s go, translate the site!\n'
html = open('OpenWorm.html', 'rw')#Open the original html of the website.
soup = BeautifulSoup(html)#convert html in to soup
language_name = raw_input('First write the language name of the new translation: ')#For example "Spanish"
#Create a folder with the name of language.
if not os.path.isdir(language_name):
os.mkdir(language_name)
print ('The folder %s has been created.'%language_name)
else:
print ('The directory %s already exists'%language_name)
exit()
#In this moment the script only translate "a" labels for example links or "p" labels.
translate_select = raw_input('Select the label what you want to translate: \n1-a\n2-p\noption: ')
traduccion = open('spanish-openworm.html', 'w')
if translate_select == '1':
label_a = soup.find_all('a')
for n in label_a:
print n.string
change_string = raw_input("New string: ")
n.string = change_string
print (n)
print ('\n')
traduccion.write(str(soup))
elif translate_select == '2':
label_p = soup.find_all('p')
for n in label_p:
print n.string
change_string = raw_input("New string: ")
n.string = change_string
print (n)
print ('\n')
traduccion.write(str(soup))
else:
print 'Error, incorrect option.'
exit()
Is very easy to understand, I call him "live translation", why?, because at the moment to launch the script, first asks to you the name of the language, with this name, the script create a folder. After, asks to you: what do you want to translate?, refers to html tags (a tags, or p tags), then, the terminal shows the string of the tag, and a line to write the new string (the translation) :D. Automatically, after completion of all tags, the script save the new website translated!
Obviously, the script need to change some lines, for example: select the html to translate, the name of the html translated, and a question to: continue editing other tags or files.
But is the basis and works very nice, here a result (only the links are translated in this case):
@GasGen congratulations on the great progress!
A suggestion: how about translating based on tag IDs and not based on tags? Basically you could just look for element ID and replace the html inside the element with the translation text, no matter what the html tag is. This way the translation file for each language is just a list of key-value pairs (ID, translation).
Keep up the good work! :)
@JohnIdol, thanks!
Great suggestion, Matteo in first time (previous comments) said me the same thing you tell me (about the take the string using the tag ID), and is a great idea, but the problem is that not all tags have an ID.
An example:
<a href="./getting_started.html">Get Started</a>
As you can see, this tag don't have an ID, only the string between him. And this is repeated in various tags (not ID).
Thanks again! :)
That's no problem - we can add IDs to everything that needs to be translated :D
@GasGen Very nice! The interactive translation mode is interesting, let me suggest a little tweak. The script could create a file with the translated resources (same as in my original proposal) as you input them. When you run the script the script automatically asks you to translate all the html elements that have an ID and that are not yet in the resource file. The script uses the file to generate the HTML page.
To recap the steps of the script 1) Check if a resource file for the given language exists, if not create it 2) Load the resource file in memory as key value pair where key is the id and value is the translated string (the map could be empty if we just created the file) 3) Start processing the HTML pages, for every ID found in the page that doesn't exist yet in the resource file (which is now loaded in memory) ask the user to input a translation as you are already doing and add the user input in the resource map which is in memory (created at step 2) 4) Save the map back to the resource file 5) Create translated HTML pages replacing the text based on id fetching the translated string from what you have in memory.
This will allow us to be able to reuse the script and the resource file after we update the website and the script will automatically prompt the user to add the missing strings. Also it gives us the freedom to translate a resource file directly without necessarily going through the interactive mode and we use the script just to generate the translated HTML files. Of course feel free to add an ID to all the the elements that don't have one!
@JohnIdol @tarelli, thanks! Great Matteo!, now I get to work in the script to update him, and I will add an ID to the elements :D, for work better.
I have an advance:
# -*- coding: utf-8 -*-
#OpenWorm Translate the website! - Localisation script
#author: Gaston Gentile
from bs4 import BeautifulSoup
import os
os.system('clear')
print 'OpenWorm site translator'
print 'Let\'s go, translate the site!\n'
option = input('what do you want do? \n1-Create source file\n2-Translate the Site\noption: ')
if option == 1:
html = open('OpenWorm.html', 'rw')#Open the original html of the website.
soup = BeautifulSoup(html)#convert html in to soup
language_name = raw_input('\nFirst write the language name of the new translation: ')#For example "Spanish"
extention = language_name[:2]
#Create a folder with the name of language and a file with the language extention.
if not os.path.isdir(language_name):
os.mkdir(language_name)
resource_file = open("%s/openWorm.%s" %(language_name, extention), 'w+')
print ('The folder %s has been created.'%language_name)
else:
print ('The directory %s already exists'%language_name)
exit()
#I use these tags, because between them have strings.
tags_id = soup.find_all(('a','p', 'h1', 'h2'), id=True)
for n in tags_id:
tag = ("%s = \n" % n['id']) #n.string to add the string for reference of translation.
resource_file.writelines(tag)
print tag
print ('The resource file openworm.%s has been created.'%extention)
html.close()
resource_file.close()
if option == 2:
#select the language of the translation
language_translation = raw_input('Generate translation of: ')
#Open the html to translate
html = open('%s/OpenWorm.html'%language_translation, 'r+')
#convert html in to soup
soup = BeautifulSoup(html)
#Open the resource file (with the translated words)
resource = open('%s/openWorm.%s'%(language_translation, language_translation[:2]))
values = {}
#Dictionary generate with the for loop and the resource file
for line in resource:
(key, val) = line.split(':')
values[str(key)] = val.replace('\n', '')
resource.close()
print values
#Replacing strings
for key, value in values.iteritems():
content = soup.find(id=key)
content.string.replace_with(value)
print content
html.write(str(soup))
In this code, only take and translate the string of one html (I then modified to take the ID's of all html files). The idea of the script works, but, I have a problem in the tags that have children, for example:
<p id="example">
This is an example of parent tag with childrens.
<a href="page_url" id="link_page">Visit the page</p>
</p>
If you see, in this case I have a parent tag ('p') and then a children tag ('a'). In these cases the script does not work, because, at the time to change the string of 'p' meets to the tag 'a', and the script crash. Now I'm thinking of a solution to fix the problem and upload the new code ;).
@GasGen nice progress! :)
About this:
<a href="page_url" id="link_page">Visit the page</p>
The above is malformed html (mis-matching opening and closing tags) and it should never happen. If we have something like that on our pages we should fix it.
This is just a suggestion but the easiest way to do this would be in my opinion:
But feel free to disregard if you have a better idea! :)
@JohnIdol, about the malformed html, sorry my mistake I wrote recently without much attention. The correct form is:
<a href="page_url" id="link_page">Visit the page</a>
In the html pages no problem like that.
Exactly, I think we mean the same thing. The script run and for each line in the resource file got the id and the translation and write each id and translation in a dictionary. Then, the script look the html and given id replace the html with the translated string. This part of the code works fine.
But the problem of the script is when meets with a parent and children tag, like the previous example:
<p id="parent_tag">
This is an example of parent tag with children.
<a href="page_url" id="children_tag">Visit the page</a>
Here the paragraph continues.
</p>
If the tag doesn't have children, the translation works, but when the script run and meets with parent and children tag have a problem.
AttributeError: 'NoneType' object has no attribute 'replace_with'
I realized that the childrens cause the problem, because if I delete the children tag, example:
<p id="parent_tag">
This is an example of parent tag without children.
Here the paragraph continues.
</p>
The script works.
Edit: Adding the function "contents" of Beautiful Soup, the problem is skipped, and the translation works, but not fine. Because, using the previous example, first translate the parent_tag and then the children_tag, in this order, and thus the result is:
This is an example of parent tag with children.
Here the paragraph continues.
Visit the page
When really "visit the page" would have to be in the middle line of the example. But I think I have the solution, in the next post I upload the solution if works.
Well, I'm very close to finish the script, have some problems, but I trying to fix it. The problem I meet, is the commented in the previous comment. If the script meet a tag (parent) with childrens (example hyperlinks), the translation is not fine. For example if the script searh the tag: "text_openWorm", and find this:
<p id="text_openWorm">
This text is from <a href="openworm.org" id="link_openworm">OpenWorm web site</a>,
and is very cool!
</p>
We have the problem. Why?, because the script at the moment to translate the string of the id "text_openWorm", meets in the middle with the other id tag (link_openworm), and if you make a translation, only translate the first fragment of the paragraph ("This text is from..."), and not the second part. To fix it, I read more about BeautifulSoup, and found the "contents" function. This function allows split the content of the tag in to fragments. The previous example, split in fragments with the contents functions is like:
First content: This text is from
Second content: <a href="openworm.org" id="link_openworm">OpenWorm web site</a>
Third content: , and is very cool!
To translate this in this case, the words or sentences are not taken from resource file, are translated from the console (for the moment).
To achieve this, I write a l conditional sentence that evaluates: "if contents > 0", namely if the tag is split in two or more parts, give me the first fragment (translate manually), then the second (translate manually), etc. And then the translation works. Else the tag have only one content, is translated automatically with the resource file. I have to polish the conditional sentence and the while to work better, but I have the basis and I uploadead at Gist: https://gist.github.com/GasGen/5680218
Great progress @GasGen! I think you will have to make the code recursive to keep into account wether the nested node will have other nested nodes and so on.
Thanks @tarelli!, great idea! I keep working.
Maybe it's a little late but I would like to suggest another approach: Take this for example
<p id="text_openWorm">
This text is from <a href="openworm.org" id="link_openworm">OpenWorm web site</a>,
and is very cool!
</p>
instead of looking for id's why not replace text with placeholders this way:
<p id="text_openWorm">
{{ home['verycool'] }}
</p>
your language file ie. spanish.tr would have something like this:
home['verycool'] = "This text is from <a href="openworm.org" id="link_openworm">OpenWorm web site</a>, and is very cool!"
then it would be just a simple case of search and replace. if the key was not found in the file, the script would look for the info in the english.tr. This way anyone could take english.tr and tanslate it to any other language. you could have one entry per page, or reuse the text on different pages
Just a rough idea.
@msasinski, nerver is late. Your idea is very cool!, also is not necessary collocate:
{{ home['verycool'] }}
if only leave the labels (with the id):
<p id="text_openWorm">
</p>
The script search in source file:
text_openWorm = "This text is from <a href="openworm.org" id="link_openworm">OpenWorm web site</a>, and is very cool!"
And then just replace.
@GasGen @msasinski if I understand the suggestion I don't like it too much to be honest. A resource file is always just about resources, it's somehting you can give to a translator for instance and it should never have HTML code in it.
@JohnIdol I don't think it's overeningeering, there's no engineering at all, it's a script with a recursive function.
What you point out is a common problem in resource files, for instance when you have to translate the content of a dialog that displays a message that contains some data coming from the program. "Error you have already used 5 invites today" in the resource file you'd have "Error you have already used {1} invites today"
Similarly we'd have something like myid = Some text {mylinkid1} some other text {mylinkid2}
What if you have this
<div id="div1"> Nice text <div id="nested"> ..... 50 lines of HTML </div></div>
? Good luck with the resource file in that case.
@tarelli if that's the direction we are going sounds like a clean solution and also simple enough
@tarelli I understand your concern, but my solution although is not perfect will prevent as from few potential issues like
Some html in resource files is unavoidable anyway. Most web frameworks work this way. Adding IDs to everything that needs to be translated will be problematic. What if you have this situation:
<p>Lorem ipsum dolor sit amet, <a href="link">consectetur</a> adipiscing elit. Donec <a href="link2">dapibus dignissim</a> nisi id placerat. Duis lorem dui, molestie id vestibulum quis, commodo <a href="link">eu</a> lacus.</p>
Here you have two links (link,link2) and you would need to create four different ids just for this part of the code. p, 2xlink,link2 as you can't have one id used for two elements. Even if you create four different ids just for this small part of the code it will make it look really bad, increase the page size, and potentially complicate css.
@JohnIdol You wrote:
<div id="myid">
Some text <a href="#" id="mylinkid1">link here</a> some other text <a href="#" id="mylinkid2">another link</a> keeps going <a href="#" id="mylinkid3">yet another link</a> all sorts of stuff.
</div>
If you have to define translations for those bits of text between the various A tags It just gets too complicated >for what we need in my opinion. I'd rather have something slightly less clean but not over-engineered.
My solution would create just one place where you would have this:
home['intro'] ="Some text <a href="#" >link here</a> some other text <a href="#">another link</a> keeps going <a href="#">yet another link</a> all sorts of stuff."
You would not need to add 50 or so ids to the page just to translate this page.
And there is another situation no one takes under consideration - there will be situation where in one language translated text will differ from the original and will require different link/html element placement. Example: What if you want to put some specific word in bold and italics?
English
With only a <strong>thousand cells</strong>, it solves <em>basic problems</em> such as feeding, mate-finding and predator avoidance.
Polish
Jest w stanie rozwiazac <em>podstawowe problemy</em> jak karmienie, poszukiwanie partnera i unikanie zagrozen mimo iz zbudowany jest tylko z <strong>tysiaca komorek</strong>
two problems are visible in this simple situation:
I understand that this will require a lot of work upfront, but we need to do this right the first time.
@msasinski The current host is not GitHub but AppEngine. This script was thought with what we have today in mind, i.e. a static website. If the website will ever have a server side component in that case translation will most likely happen server side and there are different ways to do this, still we'll be able to reuse the resource files we have today with few adjustments.
I don't think adding a unique id to every element that has some text in it is a problem. Also to answer your questions as far as I know you can put an id (any attribute in fact) on any tag, strong included.
Your example would be:
English
id1=With only a {id2}, it solves {id3} such as feeding, mate-finding and predator avoidance.
id2=thousand cells
id3=basic probllems
Polish
id1=Jest w stanie rozwiazac {id3} jak karmienie, poszukiwanie partnera i unikanie zagrozen mimo iz zbudowany jest tylko z {id2}
id2=tysiaca komorek
id3=podstawowe problemy
This will not cause any problem in CSS.
While you're right that any element can have ids it's still discourage to use them unnecessary. Most of the Formating should be done in css not html.
It's better, not only in my opinion,to have less code in the html and most of it in css. There are few reasons for it mainly page size (faster rendering, faster downloads), cleaner code, better SEO.
We are not talking about formatting here, that is already all in CSS and adding ids to the elements that have text would not change that.
@msasinski you are right that it is discouraged to use IDs for applying CSS when unnecessary - but this has nothing to do with CSS or formatting, we are talking about using IDs to identify an element (for whatever reason, in our case translation) and that's their legitimate function. There is no extra CSS involved if we add IDs. The formatting (if any) will still be done whatever way is being done now.
@tarelli, @JohnIdol While I understand that there will be no code added to css it still will make the page slightly bigger in size, and decrease text/code ratio which in turn will negatively impact SEO. It also requires additional work mainly adding ids to the text.
While I understand that this battle is mostly lost, here is my last salvo :)
Below is comparison of both solutions. html used in this example is copy of what we have currently on the http://www.openworm.org/get_involved.html page. This is just one paragraph "Curious citizen" not a whole page as that would take too much time and get too complex with the current solution
========currently proposed solution ===============
IDs needed: citizenHeader,citizenParagraph,citizenGettingStartedAnchor,citizenStrong1,citizenStrong2, citizenStrong3,citizenContactUsAnchor
Resouce file:
citizenHeader= "Curious citizen"
citizenGettingStartedAnchor = "goal"
citizenStrong1 = "raise awarness"
citizenStrong2 = "goal"
citizenStrong3 = "love"
citizenContacUsAnchor = "contact us"
citizenParagraph = "One thing we need more than anything is for people to know about us. Getting visibility for the project will help us attract attention, <strong id="citizenStrong1>raise awareness</strong> and ultimately reach our <a href="./getting_started.html#goal" id="citizenGettingStartedAnchor">goal</a> faster.
<br/><br/>
If you would like to help us you can <strong id="citizenStrong2">spread the word</strong> with your family and friends and explain them how a biological accurate simulation of a tiny worm could help a lot to accelerate the cure of diseases.You can also tell your geek friends how it would be totally cool to have a virtual worm living inside your computer.Also if there is anything you think you could help us with which doesn't fall in any of the above categories please <a href="./contacts.html" id="citizenContactUsAnchor>contact us</a>, we would <strong id="citizenStrong3">love</strong> to hear from you!"
Resulting html:
<section id="citizen">
<div class="page-header">
<h1 id="citizenHeader">Curious citizen</h1>
</div>
<div class="span9 pagination-centered">
<i class=" icon-beer icon-xl"></i>
</div>
<p class="lead" id="citizenParagraph">
One thing we need more than anything is for people to know about us. Getting visibility for the project will help us attract attention, <strong id="citizenStrong1>raise awareness</strong> and ultimately reach our <a href="./getting_started.html#goal" id="citizenGettingStartedAnchor">goal</a> faster.
<br/><br/>
If you would like to help us you can <strong id="citizenStrong2">spread the word</strong> with your family and friends and explain them how a biological accurate simulation of a tiny worm could help a lot to accelerate the cure of diseases.You can also tell your geek friends how it would be totally cool to have a virtual worm living inside your computer.Also if there is anything you think you could help us with which doesn't fall in any of the above categories please <a href="./contacts.html" id="citizenContactUsAnchor>contact us</a>, we would <strong id="citizenStrong3">love</strong> to hear from you!
</p>
</section>
========== new solution ========== resource file
citizenHeader = "Curious citizen"
citizenPragraph = "One thing we need more than anything is for people to know about us. Getting visibility for the project will help us attract attention, <strong>raise awareness</strong> and ultimately reach our <a href="./getting_started.html#goal">goal</a> faster.
<br/><br/>
If you would like to help us you can <strong>spread the word</strong> with your family and friends and explain them how a biological accurate simulation of a tiny worm could help a lot to accelerate the cure of diseases.You can also tell your geek friends how it would be totally cool to have a virtual worm living inside your computer.Also if there is anything you think you could help us with which doesn't fall in any of the above categories please <a href="./contacts.html">contact us</a>, we would <strong>love</strong> to hear from you!"
template file
<section id="citizen">
<div class="page-header">
<h1 id="citizenHeader">{{citizenHeader}}</h1>
</div>
<div class="span9 pagination-centered">
<i class=" icon-beer icon-xl"></i>
</div>
<p class="lead" id="citizenParagraph">{{ citizenParagraph }}</p></section>
resulting html - just as it is now, no new ids needed
Now multiply this to let's say 5 different languages. I rest my case.
@msasinski it is true that it requires work to place the ids, but I am honestly not so concerned about file-size and SEO (unclear how it would affect search engine optimization).
Either way, even though I side with @tarelli, you fought the good fight and it comes down to @GasGen to pick his preferred solution since he's the one who's driving this (that's the golden rule around here for things like this), and I am sure he's going to be grateful (as am I) for the clarifying examples you provided one way or the other :)
@tarelli, @JohnIdol, @msasinski great discussion. Well, I think all have good ideas, I must say that I had the idea of add the id between the couple keys. Because as @msasinski says, one of the problem is that the structure of different languages is not the same. For example:
English:
<p id="mitLicense">
All the code is under <a href="url" id="MIT">MIT</a> license.
</p>
Spanish:
<p id="mitLicense">
Todo el código está bajo la licencia <a href="url" id="MIT">MIT</a>.
</p>
If you see, only change the position of a one simple string and is already a problem.
What are the problems we have?
The first two points are related. Because the position of the tags between tags depends completely of the language that will be translated. The third point, is really a problem?, I think that more than problem should be confusing, or not? We have two solutions, both acceptable:
1- @tarelli, proposed add an id at all the tags, include the strong and other format tags. For this in the resource file, we have to write the string, and when we have a link or a format tag in the middle replace adding a couple keys with in the center the name of the id of the string formatting or linked.
2- @msasinski, proposed leave only the parent tag with the id name, then in the resource file, create the new content (translated) of the parent including all the html tag have.
I think the two proposals are good, moreover analyzing the code, the script does the same in both cases?, if I understand, the content of the parent tag is replace with the new content (the translated text). In the first case the text of parent tag have in different positions a couple of keys with the id name. The script, has to take the name and replace. In the second case, the text of parent tag is added include the translated format tags, link tags, etc. But the problem the repetition.
I agree with the proposal @tarelli.
@msasinski @GasGen in my proposal the resource file is not like that, in my proposal the resource file is:
citizenHeader= Curious citizen
citizenGettingStartedAnchor = goal
citizenStrong1 = raise awarness
citizenStrong2 = goal
citizenStrong3 = love
citizenContacUsAnchor = contact us
citizenParagraph = One thing we need more than anything is for people to know about us. Getting visibility for the project will help us attract attention, {citizenStrong1} and ultimately reach our {citizenGettingStartedAnchor} faster.
If you would like to help us you can {citizenStrong2} with your family and friends and explain them how a biological accurate simulation of a tiny worm could help a lot to accelerate the cure of diseases.You can also tell your geek friends how it would be totally cool to have a virtual worm living inside your computer.Also if there is anything you think you could help us with which doesn't fall in any of the above categories please {citizenContactUsAnchor} we would {citizenStrong3} to hear from you!
citizenStrong1=raise awareness
citizenGettingStartedAnchor=goal
citizenContactUsAnchor= contact us
citizenStrong2=spread the word
citizenStrong3=love
Note there's no HTML and the {elements} can move around to support different constructs in different languages.
@tarelli, yes I understand your proposal (the resource file is like the first I use, but including couple keys in him). I think you proposal is better because repetition is avoided, and the resource file is more clean. I'll bring progress soon.
I have an idea in the code (basing @tarelli ), well, if you remember, in the previous script I write some lines to find tags with childrens tags, but the problem was that the script took the "br" tags like a children. I fix that modifying some lines, and this part of the script is better. Thus the previous for loop is deleted and replaced with a simple if sentence. https://gist.github.com/GasGen/5680218 Now, my next step is the interaction with the resource file.
Create a Python script to generate alternative versions of the website in other languages. The script will go through a resource file for every page of the website and replace the string contained in the value of the DOM elements according to the ID specified in the resource file.
Example:
index.it
index.html
The script file will replace "We are building a digital worm. For real." with "Stiamo costruendo un verme digitale. Per davvero." and so on. The script will write the translated version in a folder named like the extension of the resource file, "it" in this case. There will be one resource file per page. If the website has 8 pages and we support 5 languages we will have 40 resource files.
The rationale behind this script is to not have redundant code and still have the website indexed in all the languages.
In order to select the different DOM elements pyquery looks at first glance like a good solution allowing to have selectors like in jQuery inside a Python script.