ropensci / software-review

rOpenSci Software Peer Review.
292 stars 104 forks source link

Submission: phonfieldwork #385

Closed agricolamz closed 4 years ago

agricolamz commented 4 years ago

Submitting Author: George Moroz (@agricolamz)
Repository: https://github.com/agricolamz/phonfieldwork
Version submitted: 0.0.8 Editor: @melvidoni Reviewer 1: @jonkeane Reviewer 2: @nikopartanen Archive: TBD
Version accepted: 2020-10-20


Package: phonfieldwork
Type: Package
Title: Linguistic Phonetic Fieldwork Tools
Version: 0.0.8
Depends: R (>= 3.5.0)
Imports:
    tuneR,
    phonTools,
    grDevices,
    utils,
    graphics,
    rmarkdown,
    xml2
Authors@R: c(person("George", "Moroz",
             role = c("aut", "cre"),
             email = "agricolamz@gmail.com",
             comment = c(ORCID="0000-0003-1990-6083")))
Description: There are a lot of different typical tasks that have to be solved during phonetic research and experiments. This includes creating a presentation that will contain all stimuli, renaming and concatenating multiple sound files recorded during a session, automatic annotation in 'Praat' TextGrids (this is one of the sound annotation standards provided by 'Praat' software, see Boersma & Weenink 2018 <http://www.fon.hum.uva.nl/praat/>), creating an html table with annotations and spectrograms, and converting multiple formats ('Praat' TextGrid, 'ELAN', 'EXMARaLDA', 'Audacity', subtitles '.srt', and 'FLEx' flextext). All of these tasks can be solved by a mixture of different tools (any programming language has programs for automatic renaming, and Praat contains scripts for concatenating and renaming files, etc.). 'phonfieldwork' provides a functionality that will make it easier to solve those tasks independently of any additional tools. You can also compare the functionality with other packages: 'rPraat' <https://CRAN.R-project.org/package=rPraat>, 'textgRid' <https://CRAN.R-project.org/package=textgRid>.
License: GPL (>= 2)
URL: https://CRAN.R-project.org/package=phonfieldwork, https://agricolamz.github.io/phonfieldwork/
BugReports: https://github.com/agricolamz/phonfieldwork/issues
Encoding: UTF-8
LazyData: true
RoxygenNote: 7.1.0
VignetteBuilder: knitr
Suggests: 
    knitr,
    testthat,
    covr,
    dplyr,
    tidyr,
    DT,
    lingtypology
Language: en-US

Scope

phonfieldowrks is a tool that helps researchers create sound annotation viewer from multiple sounds. I believe that sound annotation viewer helps to make phonetic more reproducible, since it make sound data widely availible. It helps reducing the time that annotators spent during annotations in Praat, Elan and EXMARaLDA, so it partially replace its functionality.

Technical checks

Confirm each of the following by checking the box.

This package:

Publication options

JOSS Options - [ ] The package has an **obvious research application** according to [JOSS's definition](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). - [ ] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: - (*Do not submit your package separately to JOSS*)
MEE Options - [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

annakrystalli commented 4 years ago

Hello @agricolamz and many thanks for your submission.

We are discussing whether the package is in scope and need a bit more information.

agricolamz commented 4 years ago

Hello @annakrystalli!

thank you for your response.

other packages

In the README file I've mentiond rPraat and textgRid. The first one get an access to the functionality of one particular program for phonetic/phonological research Praat and its annotation system (TextGrid). The second one is limited to operations with Praat's annotation system (TextGrid).

My package has another perspective. The main goal of the my package is to provide different tools for phonetic/phonological researcher. In order to do it I provided several functions that are not connected to Praat, but really important for the process of the phonetic/phonological research:

Summing up: my package is focused on the making process of a research a plain thing. rPraat and textgRid are focused on replication of the functionality of Praat in R.

All cutting and merging of sounds is based on tuneR, but I think there are more packages that have the same functionality.

It is also worth mentioning that my package is not the only one that creates spectrograms (Sueur 2018) list other options: signal, tuneR, seewave, phonTools, monitor, warbleR and soundgen.

Similar idea about representation of gathered data (without sound) could be found in the LingView package.

ethic policy

Honestly, I didn't thought about it. That doesn't mean that I don't care. This do mean that I need help on this topic and whatever path this submission will take, I will appreciate any help and ideas about it. I will try to list here what I know about the topic.

Simple linguistic research or documentation looks like this: researcher record some audio/video files of the speech events and then extract some linguistic features from it. Sometimes it is possible to conduct a research in a big cities with mostly literate population, then consent is not a problem. But there are also a hue amount of work that performed in a "field", where researches go to small settlements with mostly rural population. In most cases it is possible to receive a consent, but it is hard to explain what does it mean.

In some countries there are an ethic commission that decides whether it is possible to conduct a research, but in some countries there is no any policy on this question (e. g. in my own country, Russia).

Since my package make it easier to publish multiple speech samples that could be a problem, since people could be identified (especially if they are from the small community). What can I do? I can insert in my documentation and the create_viewer() function notes about what kind of consent should be gathered in order to satisfy research standards. It is easy. I would be happy if somebody help me, but I think I can gather this from other linguists.

I think that there are several things that participant should agree on:

1) conduct scientific research on gathered data; 2) publication of gathered data;

If participant agrees only on the first one, it would be useful to create an access restriction (e. g. password protection) on generated html viewer. This will limit users to academic circles. Which rmarkdown password protection strategy is better for this purpose? I have in mind encryptedRmd package, but for now it doesn't work with my viewer result (I opened an issue here, may be I will need to invent a work around).

Another thing that could be done is modulating speech signals in order to reduce an ability of person recognition.

I hope, I explained my position and my ideas, but I hope that rOpenSci community can help to elaborate this problem.

elinw commented 4 years ago

I think this is an interesting question.

Essentially this is a tool. Could the tool be used by someone who has not adequately protected their human participants from being identified or who has not obtained voluntary informed consent? Yes, however I don't think there is anything that can reasonably be done about that by the tool maker because that would really require (from my understanding) changes to the data collection process. Complicating this is the issue that while US-based researchers at federally funded institutions are required to comply with the Common Rule as well as the Belmont Report principles it was derived from, researchers in other places are subject to their local regulations. And non-researchers have no such requirements at all.

The Common Rule and the Belmont Report do not create bright lines, they talk about balancing competing principles: beneficence, justice and respect for persons are sometimes not in agreement. The harm must be weighed against the benefit. In the case of this kind of linguistic data, as I understand it, the potential harm is very much going to be context-dependent. First there is the issue of participants being identifiable, but, second, there could also be larger political and social issues that we have not thought about (for example when a language or culture is being suppressed). This is why respect for persons requires that all of the potential risks be explained to potential participants before data collection.

Overall, I do think that having some kind of notice about the need to obtain voluntary informed consent will at least foreground the issue for researchers. There could be a link to a larger document that describes the range of possible issues in more depth.

agricolamz commented 4 years ago

I'm totally agree, but it means that the same kind of note should be put on each package that works with sounds, pictures and video. And for now our concern is only about humans, but we could think on it in a greater scale: imagine that I recorded and published audio with rhinoceros' mating signal that will make it easier for poachers to lure those endangered creatures... This note also need to have antypiracy note, in order to prevent publishing music peaces. Even further: rmarkdown need to have a note reminding that it is inappropriate to use hate speech and probably include some rules from the most popular social media.

annakrystalli commented 4 years ago

Thank you for your feedback @elinw. And thank you for your patience and response @agricolamz. As we encounter new applications involving human subject data, it might take a little longer and input from domain experts like @elinw might be required to help ensure our policy is fit for purpose.

So I'll summarise where I think we are at the moment:

Overall I think just some more detailed documentation regarding data protection for privacy + the handling data according to the informed consent will be enough to start the review process. In particular drawing attention to what @agricolamz highlighted:

I think that there are several things that participant should agree on:

  • conduct scientific research on gathered data;
  • publication of gathered data;

I think you also made some good suggestions regarding encryption. Other simpler options could be warnings, asking for confirmation before performing an action (e.g. publishing data), adding particular files to .gitignore or even scanning data for potentially sensitive information (eg names).

I think a full review seems a better way to assess what strategy would be more useful and where throughout a typical workflow. So I suggest we ask reviewers to consider these points during review?

So, once some more documentation is added to the README on:

I am happy to pass it on to the handling editor.

agricolamz commented 4 years ago

@annakrystalli, @elinw,

I've added a message to the .html creating function that links to the vigniette, that I've created in order to emphasize ethical issues. It would be nice if you take a look.

Since my package is actually a bunch of small functions I don't know how to make a small demo of it, but there is a link to the whole tutorial.

I also added to README a small summary and links to other packages.

elinw commented 4 years ago

Yes I think that the vignette is very good.

annakrystalli commented 4 years ago

Thanks @agricolamz.

@melvidoni will be your handling editor. It would be good to point reviewers to the above discussion to consider while they are reviewing.

melvidoni commented 4 years ago

Editor checks:


Editor comments

Hello @agricolamz thanks for submitting! I'm Dr Melina Vidoni, and I'll be your handling editor. Here are some basic checks for your package:

> spelling::spell_check_package()
DESCRIPTION does not contain 'Language' field. Defaulting to 'en-US'

And the output for googpractice::gp:

── GP phonfieldwork ───────────────────────────────────────

It is good practice to

  ✖ write unit tests for all functions,
    and all package code in general. 0% of code
    lines are covered by test cases.

    R/add_leading_symbols.R:17:NA
    R/add_leading_symbols.R:18:NA
    R/annotate_textgrid.R:34:NA
    R/annotate_textgrid.R:35:NA
    R/annotate_textgrid.R:37:NA
    ... and 1105 more lines

  ✖ use '<-' for assignment instead of
    '='. '<-' is the standard, and R users and
    developers are used it and it is easier to read
    your code for them if you use '<-'.

    R\draw_spectrogram.R:73:5
    R\draw_spectrogram.R:74:16
    R\draw_spectrogram.R:75:31
    R\draw_spectrogram.R:76:32
    R\draw_spectrogram.R:78:11
    ... and 27 more lines

  ✖ avoid long code lines, it is bad for
    readability. Also, many people prefer editor
    windows that are about 80 characters wide. Try
    make your lines shorter than 80 characters

    R\add_leading_symbols.R:3:1
    R\annotate_textgrid.R:3:1
    R\annotate_textgrid.R:9:1
    R\annotate_textgrid.R:15:1
    R\annotate_textgrid.R:19:1
    ... and 131 more lines

  ✖ avoid sapply(), it is not type safe.
    It might return a vector, or a list, depending
    on the input data. Consider using vapply()
    instead.

    R\draw_spectrogram.R:93:11

  ✖ avoid the library() and require()
    functions, they change the global search path.
    If you need to use other packages, import them.
    If you need to load them explicitly, then
    consider loadNamespace() instead, or as a last
    resort, declare them as 'Depends' dependencies.

    R\create_viewer.R:113:1

  ✖ avoid 1:length(...), 1:nrow(...),
    1:ncol(...), 1:NROW(...) and 1:NCOL(...)
    expressions. They are error prone and result 1:0
    if the expression on the right hand side is
    zero. Use seq_len() or seq_along() instead.

    R\add_leading_symbols.R:18:40
    R\create_subannotation.R:46:10
    R\create_subannotation.R:54:23
    R\df_to_tier.R:52:31
    R\df_to_tier.R:61:31
    ... and 7 more lines

  ✖ fix this R CMD check ERROR: Running
    examples in 'phonfieldwork-Ex.R' failed The
    error most likely occurred in: > ### Name:
    srt_to_df > ### Title: Subtitles .srt file to
    dataframe > ### Aliases: srt_to_df > > ### **
    Examples > > srt_to_df(system.file("extdata",
    "test.srt", package = "phonfieldwork")) Error in
    strsplit(result[, "times"], " --> ") :
    non-character argument Calls: srt_to_df -> cbind
    -> do.call -> strsplit Execution halted
  ✖ fix this R CMD check WARNING: LaTeX
    errors when creating PDF version. This typically
    indicates Rd problems.
  ✖ fix this R CMD check ERROR: Re-running
    with no redirection of stdout/stderr. Hmm ...
    looks like a package Error in texi2dvi(file =
    file, pdf = TRUE, clean = clean, quiet = quiet,
    : pdflatex is not available Error in
    texi2dvi(file = file, pdf = TRUE, clean = clean,
    quiet = quiet, : pdflatex is not available Error
    in running tools::texi2pdf() You may want to
    clean up by 'rm -Rf
    C:/Users/e95207/AppData/Local/Temp/RtmpCIt1xR/Rd2pdfcf4c23af23e0'

Can you please fix these and advice when it is done? I will find reviewers after that.


Reviewers: @jonkeane and @nikopartanen Due date: August, Friday 28th

agricolamz commented 4 years ago

Dear @melvidoni,

I have added tests so there is a 82% coverage.

I decided to keep long lines greater then 80 characters when there are links (here is an example).

There is also one false alarm about library() function: this command is within a character string, so it is not part of the function.

melvidoni commented 4 years ago

Hello @agricolamz, thanks for the revision. It is good now.

The first reviewer will be @jonkeane. I am searching for a second reviewer now. The review deadline will be updated once I have both reviewers confirmed.

agricolamz commented 4 years ago

Wow, Sign Language specialist! @jonkeane, I'm going to extend my package for working with video for sign languages one day.

jonkeane commented 4 years ago

Hello 👋 I've already started (a very tiny bit) looking around in phonfieldwork, but I am excited to review it! I haven't done sign language research (or really linguistics in general) in a few years now, but when I saw this package in the issues it piqued my curiosity.

If you are thinking about video processing, you might want to checkout signglossR which might already have some of that functionality — I'm sure there's space for two packages and the technical needs for sign vs. spoken language phonetics/phonology is different enough it might make sense to keep them separate as well.

Anyway, I'm looking forward to what you've done already 😃

agricolamz commented 4 years ago

Hello! @jonkeane, I knew about this package, but it developed from last time I've seen it!

@melvidoni, may I propose my own reviewers? I know it is weird since I know them personally, but in case it is not weird:

melvidoni commented 4 years ago

Hello @agricolamz, thanks for the suggestions. I had contacted another potential reviewer. I'll wait for their response and if it is negative, I'll contact them. Thank you!

melvidoni commented 4 years ago

Hello @agricolamz. I am still searching for reviewers. I will provide a review deadline once the second reviewer is found.

melvidoni commented 4 years ago

Thank you @jonkeane and @nikopartanen for agreeing to review the package.

The review due date is August, Friday 28th

jonkeane commented 4 years ago

Package Review

Documentation

The package includes all the following forms of documentation:

For packages co-submitting to JOSS

The package contains a paper.md matching JOSS's requirements with:

  • [ ] A short summary describing the high-level functionality of the software
  • [ ] Authors: A list of authors with their affiliations
  • [ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
  • [ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

Final approval (post-review)

Estimated hours spent reviewing: 10-12


Review Comments

Overall this was really interesting to review and I think {phonfieldwork} shows a lot of promise (and already goes along way!) to help make some of the more repetitive tasks involved with phonetic and phonological work much easier and more reliable. I threw a few elan files that I had from my dissertation research at it and was impressed it handled them easily. I even managed to read in a folder of elan files that ultimately had around 166k annotations in ~30 seconds with eaf_to_df() (and that was even over network storage, so some of that time is shuffeling the files from a server!).

I have a few larger/more sweeping comments and ideas for changes that are more architectural or more broad and then at the bottom I went through the package function by function and have specific comments and suggestions about the code. Small note: I did some of this reviewing in two batches somewhat far apart and there were a few pushes to the github repo between the two, so if the line numbers don't match up or make sense, let me know and I'll look at them again and make sure I have the most up to date ones.

Large(r) scale comments

Think about making classes+methods for some of the analysis types that get read in

There are a number of places where you detect if an object is a TextGrid (typically by checking that the second element has "TextGrid" in it, examples: extract_intervals(), tier_to_df(), create_subannotation(), annotate_textgrid(), df_to_tier(), textgrid_to_df(), set_textgrid_names()). I think it might be beneficial to make a class for TextGrids (S3 is probably sufficient, but there are other class systems in R as well) and then you can use inherits(object, "TextGrid") instead of grepl("TextGrid", textgrid[2]). Additionally seeing that check in a few places, you might also want to make a helper function that either returns the TextGrid object if it is a TextGrid or reads in the TextGrid if it's a path.

You could even make an object type for each of the analysis objects that you read in, and if you did this, you could also consider doing things like making as.data.frame() or write() methods for each of these so that their interface worked similar to other areas of R.

This isn't a requirement and what you have here shows that you don't need this to have the functionality you want, but I think it would clean up some of the duplicated code in the package.

Function restructuring

A few places you have a pattern of one function that creates some output (like an .Rmd file that is then rendered) I wonder if it would be better to split these functions apart so that you have one function to make the source (.Rmd, .csv, etc.) and then another that lightly wraps + renders + cleans up the source. Something like:

make_special_rmd <- function(...) {
  do <- things(here)
  file_path <- tempfile(fileext = "Rmd")
  writeLines(file_path)

  return(file_path)
}

render_special_rmd <- function(...) {
  file <- make_special_rmd(...)
  on.exit(unlink(file))
  return(rmarkdown::render(file))
}

This would give you the best of both worlds: being able to generate the source with a function call, and being able to render directly with one function call without needing to have an argument (render = TRUE) that changes lots of little things in the function and returns very different results (source files as opposed to rendered ones). These smaller functions would also be much easier to test and develop iteratively as you build on them as well.

Documentation improvements

Some of the help documents could have larger descriptions describing what they do, more details about how the arguments interact, what the function might be used for, etc.

It also might be nice to have some organization on the pkgdown website. Breaking help sections into topics would add a lot to discovery and seeing how the different components of the package fit together.

String manipulation and {glue}

There's a decent amount of string manipulation that I think might benefit from using the {glue} package. I know adding a dependency might not be worth it, I bet using {glue} would make some of the string manipulations that you do easier to read and operate with.

A few general style suggestions

A few places you use right assignment with lapply() (e.g. lapply(...) -> variable), This seems a bit out of style with the rest of the left-assign-only in the package as well as most style guides that I know of.

Though the package satisfies the lintrs specified in {goodpractice}, there are a number of linting rules from {lintr} that aren't followed consistently. This isn't a huge deal, but fixing them (possibly with {styler}) would make even more easily readable.

Comments on specific functions/files

add_leading_symbols() — Does this function need to be exported? I'm not totally sure it does, but maybe for use in markdown documents?

audacity_to_df() — You could replace:

source <- unlist(strsplit(normalizePath(file_name), "/"))
source[length(source)]

with basename(file_name)

concatenate_soundfiles()

create_glossed_document()

create_image_look_up()

create_subannotation()

create_viewer()

draw_sound()

eaf_to_df()

exb_to_df()

extract_intervals()

get_sound_duration()

get_textgrid_names()

rename_soundfiles()

srt_to_df()

textgrid_to_df()

tier_to_df()

The title() function in flextext_to_df.R this seems to be only used inside of draw_sound() I wonder if it should be moved to that file or a separate utils.R file or something.

Tests

More minor comments

Formatting

I think you want an additional line header above this list to format as a list https://github.com/agricolamz/phonfieldwork/blob/master/vignettes/phonfieldwork.Rmd#L97-L98

The tutorial link in the readme links to the pkgdown website, which is circular when it is used as the index page for pkgdown.

Installation

I needed to install pandoc-citeproc before R CMD check would run. Adding this to DESCRIPTION as a system dependency and a note in the readme would be nice!

agricolamz commented 4 years ago

Thank you, @jonkeane!

melvidoni commented 4 years ago

Tanks @jonkeane for the review! @agricolamz you can start checking it, otherwise, you can wait for @nikopartanen to finish their review. There are still 4 days left.

nikopartanen commented 4 years ago

Package Review

Documentation

The package includes all the following forms of documentation:

Functionality

Final approval (post-review)

Estimated hours spent reviewing: 10


Review Comments

phonfieldwork is a very interesting and useful R package, and is a very welcome addition to the tools we have available to work with speech data. I also did this review during a few weeks in several times, please let me know if some of my comments relate to something already outdated. I installed the package again from GitHub before submitting my review and tried to check that my comments are still valid.

How I understand it, the package has several larger areas where it can be used, and which form the pipeline described also in the documentation. 1) Creating stimuli sets for data collection, 2) Organising data collected with these stimuli, 3) Processing and further annotating the annotation files, 4) Reading into R resulting datasets, and 5) Visualizing this data.

The package is designed so that the different part of the pipeline work well together. Many of my recommendations relate to the ways how the package could be made more useful also in a little bit different scenarios, where the data is collected differently or the user already has some materials in a different format and wants to take advantage of functionalities of the package later in the pipeline.

Creating stimuli and organizing the files

This functionality is very useful and well thought. There aren't that many tools around for this kind of task. In the package documentation there is a good explanation of the intended workflow, with notes about recording practices.

I recorded some test audio with Audacity following the documentation, and ended up with files called 1.wav, 2.wav and 3.wav, containing words tip, tap and top. I used these in most of my testing within this part and visualizations.

One problem I encountered with rename_soundfiles function is that there doesn't seem to be any way to keep the original recording order intact. I end up with files tap.wav, tip.wav and top.wav. However, even when I have the backup, I don't have the information stored anywhere that 1.wav is now called tip.wav. It could be useful to have an argument that would rename the files, but add a number following the original order, i.e. 01_tip.wav, 02_tap.wav and 03_top.wav. Or maybe the function could also write a log file into the backup directory that lists how the renaming was done. I'm just thinking about a scenario where we realize much later that something has gone wrong, and for whatever reason we need to go back to original recordings. As far as I understand, now also suffix = paste0("_", 1:3) would not keep the original stimuli order? I think one could often want to record the same word under a somehow different stimulus, and maybe for analysis it would be clearer to keep the original order.

I understand from the documentation that the user creates the stimuli, shows them to a research participant, records stimuli as individual clips, and then ends up with a list of recordings that is the length of stimuli list, and has thereby already removed accidental double takes. This if fine, and in many cases a really well functioning workflow, and the package does provide a good help in doing this.

It can be useful to take a look into SpeechRecorder by BAS, which provides similar functionality. The advantage of this software is that one can record directly within a template, and use a microphone through an audio interface connected into a laptop. The main problem I see in using phonfieldwork's stimuli list is that it isn't entirely clear from the documentation how the researcher should handle repeated or missed elicitations, and when the number of elicitations is in hundreds of items, then managing this can be really complicated. This is of course a problem that documentation can solve, and there are already good recommendations related to the recordings.

Maybe it would be useful to add into a stimuli list a clear sequential id number, so the user can easily listen, for example, to the stimulus 75 and check that this is the same as 75th file in their recording directory. I understand the page number in stimuli file probably already fills this need, and maybe it is enough. Additionally, one could also specify in the rename_soundfiles new arguments missing, where the user could basically be able to say that "for whatever reason we don't have recordings for stimulus 21, 34 and 94, rename things accordingly". Otherwise the user gets the error Number of files is smaller then number of stimuli.

Processing data

Annotating TextGrid files works very well, and the functions are well thought and form a clear pipeline. Examples in the documentation are also clear.

It seems to me that there is no way to create a TextGrid file besides concatenate_soundfiles function. I think, however, that in many situations we have annotations in some of the formats the package can read, and one would like to work further with the package's spectrogram visualization methods. When we want to edit the annotations at this level of exactness, having it in Praat is very beneficial. So possibly it could be useful if there was a function create_textgrid that would create an empty TextGrid file that the user could populate with the functions annotate_textgrid and create_subannotation. Here I'm thinking a situation where the user would like to jump into the phonfieldwork's pipeline more in the middle, i.e. writing intermediate Praat files from segments read from another corpus.

Separating the folder processing side of concatenate_soundfiles and working with individual files could be good. Maybe user could also provide a list of audio files to be concatenated? It is possible this wouldn't fit to the logic of the pipeline the package is designed for, but I leave the idea here just in case.

Reading data into R

Reading datasets into R works really well. I tested extensively with both ELAN and Exmaralda corpora, and the resulting data frame is well structured, and the package works without problems with different ELAN files. I must emphasize that these functions worked better with .eaf and .exb files than anything I have seen before, so I'm very happy with them.

I used while testing Exmaralda-files from Nganasan Spoken Language Corpus and Selkup Language Corpus. I also used an individual FlexText file I found from here, which belongs to corpus The Documentation of Eibela: An archive of Eibela language materials from the Bosavi region. For testing ELAN I used various files I had at my laptop, as this is very common format it was easier to find different examples. Using ELAN files was also entirely free from adventures.

With some Exmaralda files I did encounter problems, this should repeat my error:

download.file(url = "https://corpora.uni-hamburg.de/hzsk/de/islandora/object/transcript:slc-1.0.0_BEP_1973_Itja1_flk/datastream/EXB/BEP_1973_Itja1_flk.exb", destfile = "BEP_1973_Itja1_flk.exb")

phonfieldwork::exb_to_df("BEP_1973_Itja1_flk.exb")

Which results in:

Error in data.frame(tier = i, id = seq_along(content), content = content,  : 
  arguments imply differing number of rows: 1, 0

Maybe there could be some generic error message that prints the path of the file and tells the file cannot be parsed, and then shows the underlying error message. In a typical corpus creation workflow the files we fail to parse are usually a very good indicator for the files that need to be checked. So that some files fail to parse is just typical, and no problem necessarily in itself. Especially if the file is broken it is better to fail than reading it wrong. I assume in some corpora there are just very unusual file structures, possibly related to old conversions from a format to format, so it is impossible to predict everything that is inside those files. But there may be some patterns that are so common that checking what is going on can improve the function.

I also encountered one FlexText file that I couldn't parse, which is here. This file doesn't seem to contain the word-level annotations, so that it doesn't work is not surprising, of course. If this kind of files are more common (I have no ideas, FlexText files are surprisingly hard to find!), it could help to have a specific error or warning message for this.

I think most of the time after the user reads these files into R, they would need to filter it a bit and probably use pivot_wider to get into right columns the content they want to work with. This isn't strictly speaking a matter of this package, but it could be nice to have in a vignette an example of such process, especially if the goal is to make the package beginner friendly. Something like:


download.file("https://corpora.uni-hamburg.de/hzsk/de/islandora/object/transcript:nslc-0.2_ChKD_72_ManyTents_flks/datastream/EXB/ChKD_72_ManyTents_flks.exb", destfile = "ChKD_72_ManyTents_flks.exb")

phonfieldwork::exb_to_df(file_name = "ChKD_72_ManyTents_flks.exb") %>%
    filter(tier_category %in% c("tx", "ps", "mp", "mc", "mb", "ge", "gr")) %>%
    select(-tier, -id, -tier_name, -tier_type) %>%
    pivot_wider(names_from = tier_category, values_from = c(content))

This way we end up with a nice data frame that has tokens, POS-tags and morphological analysis organized so that we have one observation (token now, could also be the utterance) per one row. It is impossible, of course, to predict which tiers the user wants to work with, which is why simple examples that give a direction to further possibilities could be useful. Or just to mention that pivot_wider is something to look into for further processing and analysis of the resulting data frame.

Note: These corpora are distributed under CC BY-NC-SA license, so using some other examples in the package vignette is certainly recommendable, as the license is not compatible with that of the package. I assume using these files in my review and testing is without problems. I originally wanted to include a screenshot of the resulting dataframe, but I excluded it now in order not to have actual corpus content shown.

I agree with the other reviewer that reading individual files and reading all the files in the directory should be somehow separated. Especially since when the user reads in whole directories there should be good control over things like finding the files recursively or not etc.

Visualizations

These work well, and as described. I'm also personally extremely happy to be able to create good-looking spectrograms in R. These are just small suggestions that possibly could improve the usability.

Adding the annotations from a data frame into a visualization made with draw_sound works really beautifully. It took me quite much time to understand that it is possible to pass a dataframe into annotations argument. Now the help text says a source for annotation files (e. g. TextGrid) – a mention of a data frame should also be here, this is now too easy to miss, so it should be prominently in the documentation as well.

As a technical note, using a tibble instead a data.frame seems to give an error here. I think it would be important that the package would not make a difference between data.frame and tbl_df classes.

Bonus points for Raven style annotations, that's a great addition.

Further comments

I think create_glossed_document is a very useful function, but I'm not entirely sure if this package is ultimately the best place for it. It's a bit aside from the current pipeline the package provides, although it is a really important tool. One way to connect it better would be to be able to give it instead of a FlexText file a data frame with specific columns. Then it would link more clearly to other capablities of the package: reading the corpus, manipulating and analyzing it in R, and then converting some specific parts to glossed documents that can be used elsewhere.

What it comes to the HTML-export, it could be worth checking out Leipzig.js JavaScript library, as the glosses are much more pretty when aligned correctly. However, especially Word export is something I've heard wished for a lot, so it's nice to see it exists and works well here.

agricolamz commented 4 years ago

Thank you, @nikopartanen!

nikopartanen commented 4 years ago

You are more than welcome, @agricolamz! It was a pleasure to dive deeper into this package, and I haven't probably even got that far yet with it, but today was the deadline so I wanted to respect that. Have a good weekend!

melvidoni commented 4 years ago

Thanks @nikopartanen for the review! @agricolamz Please let us know when the changes are up. Write a comment explaining them, so the reviewers can review the changes.

melvidoni commented 4 years ago

Hello @agricolamz, are there any updates in your reply to reviewers?

agricolamz commented 4 years ago

only small updates, but I will make more till the end of this week

agricolamz commented 4 years ago

Dear @jonkeane and @nikopartanen, I've made a lot of changes. Thank you again for your comments.

I'm mostly finished. The only big changes that left are

I will finish it during this week.

Here are my answers to some of your comments (for most of them, I just change everything according to them).

For @jonkeane

pandoc-citeproc

I needed to install pandoc-citeproc in order to get started, this should probably be listed in the SystemRequirements field of the DESCRIPTION + it would be nice to include it in the readme.

It looks like that pandoc-citeproc is a dependency of rmarkdown and not of phonfieldwork. So I'm not sure that it should be in the phonfieldwork DESCRIPTION, but I added it to Introduction section.

Think about making classes

I will keep it in mind, but for the time being I created a function read_textgrid() that is more general then previous, since it deals with encodings (thanks to @artemklevtsov for helping me with it).

Function restructuring

I need more time to think about it.

Documentation improvements

I added sections to pkgdown refference page

String manipulation and {glue}

I don't want an additional dependencies. But may be I will change my opinion in the future.

srt_to_df()

On lines 44, 55, and 56, I see use of integers (e.g. c(3:4)) for indexing. This is probalby ok since you define the result that is being indexed, but it might be nice for readability to use names here.

I'm not sure what kind of names could be used here, since the common

Tests

If your test-*.R files are named the same (after the test- bit) as your *.R source files, you can run them individually with `cmd+shift+i (most of yours are, but there are a few exceptions)

For some reason it doesn't work for me (I checked both with _ and - signs in function names). So I found and fixed only one mismatch.

Comments on specific functions/files

create_glossed_document()

You might consider using a flextext class here (if you're using S3, you can make it so that are also data.frames, if you're using S4 or R6 or reference classes (all of which are probably overkill here) you could make this class inherit from data.frame). This way you could test that the object was of that class and write one method that checks that it has the column you have specified there.

I'm not so big fun of classes, but I will keep it in mind.

create_viewer()

Would it be possible to create a report where some sounds have images, but others don't? (Or even one where it's just sounds and no pictures, or just pictures and no sounds?)

Not in the current version, but I will consider to make it possible in the future.

Is it possible to have mp3 audio files in the viewer?

I'm not sure if this is using Safari, or something about the images, but the images never seemed to show up (or would show up on the map only in the example) when clicking on the eye. I didn't dig too much to see what was going on here, but you might want to try it out in a few different browsers / settings.

It is a known problem. I will fix it soon.

textgrid_to_df()

I had at least one textgrid that had nuls in it such that I had to add , skipNul = TRUE to the readLines() call on line 33 to let this work (the warning I got was line 50 appears to contain an embedded nul, though that will result in all of the lines being "". I wonder if you should pass all of the options (maybe via ... from textgrid_to_df() through to readLines).

May I look at it? I've implemented some encoding cracker since this version, may be now it is work.

You might also want to add a check that the lines in tg aren't all "" (like with the above error) and error proactively there. Otherwise, the error happens later on line 53 where split_text doesn't exist which is harder to debug that this was an encoding/input problem.

I'm not sure about it. Since I think that I solved the possible encoding problems, I think this is not so important. But I think you are right, I need to consider all weird possibilities.

For @nikopartanen

Creating stimuli and organizing the files

I added autonumber and loging arguments to rename_soundfiles() function. They both can be switched off. The autonumber argument add prefix with number (and appropriate zero padding). The logging arguments creates a logging.csv file with the change correspondencies.

I added SpeechRecorder (for some reason, I can't install it on my PC) to the vignette. It would be nice to add once an R function that will create a recording script for them.

I added missing argument -- this is really nice thought.

Processing data

I added a function create_empty_textgrid().

Reading data into R

I fixed exb_to_df(), so your example now parsed into table.

I add an error for those flextexts that doesn't have a word-level annotations. It looks like kind of a bigger work, so I will fix it in the future.

I added a new vignette about manipulating with tables with tidyvers. Grammar and spelling will be corrected soon.

Visualizations

I changed documentation little bit, in order to say explicitly about df as annotation source.

As a technical note, using a tibble instead a data.frame seems to give an error here. I think it would be important that the package would not make a difference between data.frame and tbl_df classes.

Thank you for this spot! Fixed it.

Further comments

I think create_glossed_document is a very useful function, but I'm not entirely sure if this package is ultimately the best place for it. It's a bit aside from the current pipeline the package provides, although it is a really important tool. One way to connect it better would be to be able to give it instead of a FlexText file a data frame with specific columns. Then it would link more clearly to other capablities of the package: reading the corpus, manipulating and analyzing it in R, and then converting some specific parts to glossed documents that can be used elsewhere.

It is really nice thought, I added it as an issue.

About Leipzig.js -- it is nice, but it has its own issues that we started discussing long ago.

agricolamz commented 4 years ago

So now in order to separate functions for reading individual files and reading all the files in the directory I created a function read_from_folder <- function(path, FUN){...} which reads each file format from folder according to some existing function in phonfieldwork, e. g. read_from_folder("test", textgrid_to_df). What do you think, should I also create a separate functions like textgrid_to_df_from_folder(), eaf_to_df_from_folder()?

melvidoni commented 4 years ago

Hello @jonkeane and @nikopartanen. Please, take a look at the submitted revisions, following our guide. Let me know in about 3 weeks at the most if @agricolamz needs to complete another revision.

jonkeane commented 4 years ago

Hello again,

I’ve reviewed the changes and the responses and I’m satisfied with both.

For the encoding issues, I was using some of the files I found in http://www.ims.uni-stuttgart.de/phonetik/helps/praat-scripting/scripts.zip that gave me some of the encoding issues I mentioned.

melvidoni commented 4 years ago

Thanks @jonkeane. @nikopartanen did you had time to look at the changes?

nikopartanen commented 4 years ago

Hello! Sorry for the delay! I tested the changes quite thoroughly last week when I used the package in one research task of my own, and I'm very satisfied with the way the package works now and how the ideas for changes were addressed. Thanks for your great work, @agricolamz!

melvidoni commented 4 years ago

Since both reviewers approved... Approved! Thanks @agricolamz for submitting and @jonkeane @nikopartanen for your reviews!

To-dos:

Should you want to acknowledge your reviewers in your package DESCRIPTION, you can do so by making them "rev"-type contributors in the Authors@R field (with their consent). More info on this here.

Welcome aboard! We'd love to host a post about your package - either a short introduction to it with an example for a technical audience or a longer post with some narrative about its development or something you learned, and an example of its use for a broader readership. If you are interested, consult the blog guide, and tag @stefaniebutland in your reply. She will get in touch about timing and can answer any questions.

We've put together an online book with our best practice and tips, this chapter starts the 3d section that's about guidance for after onboarding. Please tell us what could be improved, the corresponding repo is here.

agricolamz commented 4 years ago

@melvidoni, could you give me an admin access? I think that I've done the rest (except a blog post).

melvidoni commented 4 years ago

You should have access now @agricolamz!

agricolamz commented 3 years ago

Dear @stefaniebutland, I would like to make a blogpost about phonfieldwork. May I get a publication date?

stefaniebutland commented 3 years ago

Excellent @agricolamz. There are a couple of options for publication dates, depending on how urgent it is for you.

  1. publish Tues Jan 19; submit draft by Tues Jan 12.
  2. publish Tues Dec 22; submit draft by Mon Dec 14, with agreement to have all edits done before Fri Dec 18

Blog post drafts are reviewed by rOpenSci Community Assistant @steffilazerte. Since your lingtypology post we have written https://blogguide.ropensci.org/ with detailed content and technical guidelines.

agricolamz commented 3 years ago

@stefaniebutland, thank you! I have already written my first draft, so I choose the second option.

agricolamz commented 3 years ago

@stefaniebutland thank you! I have already wrote a blogpost, so, I'd take the second option. Created a pull request.