urol-e5 / deep-dive-expression

Analyzing gene expression and gene/ncRNA coexpression in three species of stony coral. This work builds on completed summary of the ncRNA landscape of these species (deep-dive).
1 stars 0 forks source link

ShortStack unicode error #11

Closed shedurkin closed 1 month ago

shedurkin commented 1 month ago

code log file with error message ShortStack ran for a few hours, generating the bulk of its preliminary output files (pic of some shown below), but errored out when it began miRNA identification with the below error:

UnicodeEncodeError: 'ascii' codec can't encode character '\u2026' in position 41252: ordinal not in range(128)

image

At first I thought the new genome might have an unfavorable unicode character in it, since that is the only file input different from other successful runs, and because the genome file is likely involved in miRNA identification. However, when I searched for the unicode character '\u2026' (which is the elipses character …) in the genome file (grep -P '\x{2026}' Apulchra-genome.fa), I didn't find anything. Some googling suggests an error related to the elipses character may be related to how very long file paths are read, but none of the file paths provided for this run are particularly long. Also, the file paths are all the same length as in code that was successfully run in deep-dive.

Has anyone seen an error like this before?

sr320 commented 1 month ago

Possible solution via the chatGPT

The error message you're encountering is a Python UnicodeEncodeError that states:

UnicodeEncodeError: 'ascii' codec can't encode character '\u2026' in position 41252: ordinal not in range(128)

Breaking Down the Error:

  1. UnicodeEncodeError: This is an exception that occurs when a Unicode-related encoding or decoding operation fails. In this case, it happens during an encoding process.

  2. 'ascii' codec can't encode character '\u2026': The 'ascii' codec is being used to encode a Unicode string. The character \u2026 is a Unicode representation for the ellipsis character '…'.

  3. in position 41252: The error occurred at character position 41,252 in the string being processed.

  4. ordinal not in range(128): ASCII encoding can only handle characters with ordinal values (code points) between 0 and 127. The ellipsis character has a higher ordinal value, so it can't be encoded using ASCII.

What This Means in the Context of ShortStack:

ShortStack is attempting to process text data that includes the ellipsis character '…'. However, somewhere in the code or in the libraries it uses, it's trying to encode this data using ASCII encoding, which doesn't support characters outside the standard ASCII range.

Possible Causes:

How to Fix It:

  1. Use UTF-8 Encoding: Modify the code to use UTF-8 encoding, which supports a wide range of characters, including the ellipsis.

    # Example of specifying UTF-8 encoding in Python
    text = text.encode('utf-8')
  2. Update Locale Settings: Ensure that the system or environment where ShortStack is running is set to use UTF-8 encoding.

    • On Unix-like systems, you can set the LANG and LC_ALL environment variables:

      export LANG=en_US.UTF-8
      export LC_ALL=en_US.UTF-8
  3. Handle Unicode Properly in Code: If you have access to the source code, make sure all string handling properly accounts for Unicode characters.

    • Avoid implicit conversions between Unicode and byte strings.
    • Use Unicode-aware functions and libraries.
  4. Sanitize Input Data: If the special character isn't necessary, you could preprocess the input data to remove or replace characters that aren't ASCII-compatible.

    # Example of removing non-ASCII characters
    import unicodedata
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('ascii')
  5. Check Third-Party Libraries: If ShortStack uses external libraries, ensure they are up to date and handle Unicode correctly.

Conclusion:

The error occurs because ShortStack is trying to encode a Unicode character (ellipsis '…') using ASCII encoding, which doesn't support it. By adjusting the encoding to UTF-8 and ensuring that all parts of the software handle Unicode properly, you can resolve this error.

Recommendation:

kubu4 commented 1 month ago

Not a solution, but there's a new version of ShortStack which is supposed to be 50% faster. Maybe try running with that and see what happens?

New environment name to activate: ShortStack-4.1.0_env

And, just like we did during Science Hour, be sure to close your environment and restart R before running with this.

sr320 commented 1 month ago

@shedurkin where does this stand?

shedurkin commented 1 month ago

Finally figured out the problem -- I had downloaded the cnidarian+mirbase reference fasta incorrectly, so it contained incompatible characters. I fixed the dowload and had a successful ShortStack run using the new version (rendered code). Not only is the new ShortStack version much faster, it also apparently generates structural visualizations for all annotated miRNAs!

kubu4 commented 1 month ago

I had downloaded the cnidarian+mirbase reference fasta incorrectly

Can you please add a note here with some deets on how you downloaded it incorrectly, as well as how you ended up downloading it correctly?

Could be useful for future reference. Thanks!

shedurkin commented 1 month ago

Yep, I just used the incorrect github link. I found the cnidarian-mirbase fasta file in the deep-dive repo and needed to download a copy to deep-dive-expression. I originally used the link available when viewing the file through github ( https://github.com/urol-e5/deep-dive/blob/main/data/cnidarian-mirbase-mature-v22.1.fasta). This is actually accessing the webpage that displays the file though, not the raw file itself. To download the file directly, I needed to use the raw link to the file (click the "Raw" button at the top-right of the file viewer). This redirects you to the actual content of the file.