lmlui / Jorg

MAG Circularization Method
GNU General Public License v3.0
33 stars 5 forks source link

MIRA assembler crash with read IDs >40 characters #11

Open jungbluth opened 1 year ago

jungbluth commented 1 year ago

When MIRA encounters reads with >40 characters, it crashes with the following error:

Fatal error (may be due to problems of the input data or parameters):


  • 3413720 reads were detected with names longer than 40 characters (see output *
  • log for more details). *
  • *
  • While MIRA and many other programs have no problem with that, some older *
  • programs have restrictions concerning the length of the read name. * .....

Jorg needs a way to manage through either 1) automated read ID shortening/renaming or 2) by having a controlled and immediate exit if users start the Jorg tool by inputting fastq read IDs with headers >40 characters.

tnn111 commented 1 year ago

Long read IDs are a problem in many settings.

Anyone with problems, use seqtk rename to change the read names.

I’d suggest adding a note to the documentation that read IDs cannot exceed 40 characters and leave it at that.

On Apr 17, 2023, at 08:10, Sean Jungbluth @.***> wrote:

When MIRA encounters reads with >40 characters, it crashes with the following error:

Fatal error (may be due to problems of the input data or parameters):

3413720 reads were detected with names longer than 40 characters (see output log for more details). While MIRA and many other programs have no problem with that, some older programs have restrictions concerning the length of the read name. * ..... Jorg needs a way to manage through either 1) automated read ID shortening/renaming or 2) by having a controlled and immediate exit if users start the Jorg tool by inputting fastq read IDs with headers >40 characters.

— Reply to this email directly, view it on GitHub https://github.com/lmlui/Jorg/issues/11, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMXPRQ4G4WGVWO72Y2X5VLXBVMNVANCNFSM6AAAAAAXBKOLVM. You are receiving this because you are subscribed to this thread.

jungbluth commented 1 year ago

To clarify, this issue was reported by two KBase users. I recommended they rename sequence IDs, so there is no current block.

This issue should be addressed at some point to lower the maintenance cost of the KBase Jorg implementation. Or, alternatively, maybe KBase requires a universal solution for long read IDs? The second sounds like it would take longer to implement, so I'm not sure what's best here.