juanrloaiza / latinamerican-philosophy-mining

Text mining philosophy journals in Latin America.
0 stars 2 forks source link

Latin American Philosophy Mining

Authors: Juan R. Loaiza (URosario) and Miguel González Duque (ITU Copenhagen)

In this repository we track progress on a research project in which we apply text mining to philosophy journals in Latin America. Our aim is to provide insights into the history of philosophy in Latin America using a data-driven approach.

We started with Ideas y Valores (Colombia) and articles from 2009 to 2017. We are now expanding the corpus from Ideas y Valores to cover all articles since the journal's foundation in 1951. We plan on expanding later to include more years and other journals such as Crítica (Mexico) and Análisis Filosófico (Argentina).


TODO: This structure is now outdated.

├── data                # Data files (omitted from Git repository for the moment)
|   ├── corpus          # Parsed JSON files after preprocessing.     
|   ├── rawHTML         # Raw HTML files directly as scraped with metadata.
|   ├── rawPDF          # Raw PDF files directly as scraped with metadata.
|   ├── parsedHTML      # Parsed HTML using Article class (see utils).
|   └── parsedPDF       # Parsed PDF files to produce common JSON files.
├── extras              # Extra notebooks with additional processes or figures.
├── notebooks           # Notebooks with preprocessing and analyses.
|   ├── models          # LDA Models we have used.
|   └── wordlists       # Stopwords and protected words lists
├── utils               # Helper utilities
└── README.md


Preliminary figures and visualizations

Figure 1. Documents by main type per decade.

Documents by type/year

Figure 2. Word cloud of the most mentioned philosophers in the corpus.

Most mentioned authors in the corpus

Figure 3. Word cloud of the most frequent keywords in the corpus according to article metadata.

Most frequent keywords in the corpus

Figure 4. Word counts by year.

Word counts by year

Note: This suggests that word extension has not changed significantly since the journal's foundation in 1951. This contradicts a common intuition that philosophy is moving towards shorter articles.

Using a provisional model

The following plots are only proofs of concept. We are using a temporary LDA model with 10 topics to find which visualizations would work best. There is still work to fully optmize the LDA model though. We use a model with the following top 10 most salient words.

Topic 0 Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9
lenguaje kant religioso ser creencia ser político acción alma político
interpretación concienciar religión cuerpo mundo mundo formar moral ser derecho
teoría ser ciudad formar ser hegel vida ser platón moral
experiencia concepto filosofía heidegger teoría filosofía ser accionar filosofía ser
wittgenstein objetar historia modo propiedad dios filosofía agente conocimiento justicia
filosofía experiencia siglo aristóteles término bien nietzsche personar sócrates bien
ser arte cultura ente contener vida foucault desear hombre social
problema husserl tradición naturaleza concepto razón social intención virtud sociedad
autor trascendental ciencia bien físico hombre crítico bien bien teoría
filosófico modo obrar existencia objeto pensar pensamiento libertar obrar razón

These plots still use the year range from 2009 to 2017. We will expand on these soon when we implement the LDA model on the whole corpus.

Figure 6. Proportion of articles by topic

Proportion of articles by topic

Figure 7. Word counts by topic.

Word counts by topic