nextstrain / nextclade

Viral genome alignment, mutation calling, clade assignment, quality checks and phylogenetic placement
https://clades.nextstrain.org
MIT License
219 stars 61 forks source link

Add TSV output #159

Closed garfinjm closed 4 years ago

garfinjm commented 4 years ago

Hello,

Nextclade is a fantastic tool and has saved me tons of time.

One small request, the "Export to CSV" output seems to be using semicolons as the field separator instead of commas. This formatting does weird things if you try to import it into excel or parse it with cut.

seqName;clade;alignmentStart;alignmentEnd;mutations;totalMutations;deletions;totalGaps;insertions;totalInsertions;missing;totalMissing;totalNonACGTNs;QCStatus;QCFlags;errors
sample1;-;35;29685;C241T,C1059T,C8782T,C18060T,A23403G,A24694T,G28812T;7;23173;1;" 10163T";1;2264-2413,5953-6176,7232-7311,8336-8341,8454-8602,9649-9795,20186-20477,22341-22524,22904-23130,23519-23686,23819-23943;1741;0;Fail;missing data;
sample2;19A;36;29852;C241T,C1059T,A2480G,G11083T,G14350T,G18733T,G25991C,T28144C;8;;0;" 11084T";1;1643-2197,4299-4309,5952-6176,7112-7312,9057-9211,9579-9795,10764-11005,11944-12112,18366-18606,19609-19857,20562-20793,21163-21309,21796-21970,22343-22525,22898-23131,25364-25611,25993-26206,26587-26846,29044-29301;4197;0;Fail;missing data;
sample3;19A;36;29860;C225T,C884T,G1397A,C7979T,G8653T,C11074T,G11083T,A19073G,C21707T,A22486G,C25708T,C26895T,G27915T,T28688C,C28868T,C29077T,G29742T;17;;0;;0;20249-20252,28984-28991;10;1;Pass;;
sample4;20C;38;29859;C241T,C1059T,C3037T,C3773T,C7086T,G12662A,T14191C,C14408T,C16260T,A23403G,G23900C,A24253T,G25563T,C28821A;14;;0;;0;22405-22474;69;0;Pass;;

A "Export to TSV" option where the semicolons are replaced with tabs would be great and increase compatibility with many tools!

ivan-aksamentov commented 4 years ago

Hi @garfinjm , thanks for your feedback!

I added TSV export just now in #160

If I remember correctly, our primary justification for choosing ; as a delimiter was that some of the values we don't control (they come from the user) and may contain tabs. My version of Excel also gets confused by the mixture of ; and ,, so I am not against trying tab delimiter as well.

The serialization library we are using, papaparse, should automatically quote fields in case any of them contain delimiter characters, so TSV should be safe even if, for example, sequence name contain a stray tab character. Hopefully, Excel and other tools should understand the quotation marks in this case.

Please try the new development version of the app deployed to: https://nextclade-git-feat-tsv-export.neherlab.vercel.app/ and let us know if it works for you.

I will let @rneher to make the final judgement, double-check and merge if okay. If approved, we can release this new version right away.

ivan-aksamentov commented 4 years ago

@garfinjm We released TSV exports in 0.3.7 just now.

garfinjm commented 4 years ago

Thanks @ivan-aksamentov and @rneher! :thumbsup: